WildIFEval: Instruction Following in the Wild

2025-02-14 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

WildIFEval is a new large-scale dataset comprising 12K real user instructions with diverse, multi-constraint conditions, collected from the Chatbot Arena. Developed by Gili Lior et al. from The Hebrew University of Jerusalem and IBM Research, this benchmark categorizes constraints into eight high-level classes. Extensive experiments benchmarking 14 leading LLMs, including Deepseek-v3, Mistral-Large-instruct-2407, and various Llama3.x models, revealed that all evaluated models experience performance degradation with an increasing number of constraints. The best model achieved a score of only 0.65, indicating substantial room for improvement. Furthermore, the specific type of constraint plays a critical role, with length-related constraints proving particularly challenging for LLMs.

Key takeaway

For AI Scientists and Machine Learning Engineers developing LLMs, you should prioritize improving instruction-following capabilities, particularly for tasks involving multiple and diverse constraints. Your models currently show significant performance drops with increased constraint complexity and struggle notably with length-related requirements. Consider fine-tuning models on datasets like WildIFEval to address these specific weaknesses and enhance real-world applicability, moving beyond simpler instruction sets.

Key insights

LLMs struggle with multi-constraint instructions, especially length, highlighting a need for improved instruction-following.

Principles

LLM performance degrades with more constraints.
Constraint type significantly impacts model performance.
Length constraints are particularly challenging for LLMs.

Method

WildIFEval curation involves filtering Chatbot Arena data using Llama3.1-405b, then decomposing tasks into constraints with Llama3.1-70b, and classifying them into 8 types.

In practice

Utilize WildIFEval to benchmark LLMs on complex, real-world instructions.
Analyze specific constraint types to identify model weaknesses.
Focus on improving LLM adherence to length-related constraints.

Topics

Instruction Following
LLM Benchmarking
Multi-Constraint Generation
Dataset Curation
Constraint Taxonomy
Natural Language Generation

Code references

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.