WildIFEval: Instruction Following in the Wild
Summary
WildIFEval is a new large-scale dataset comprising 12K real user instructions with diverse, multi-constraint conditions, collected from the Chatbot Arena. Developed by Gili Lior et al. from The Hebrew University of Jerusalem and IBM Research, this benchmark categorizes constraints into eight high-level classes. Extensive experiments benchmarking 14 leading LLMs, including Deepseek-v3, Mistral-Large-instruct-2407, and various Llama3.x models, revealed that all evaluated models experience performance degradation with an increasing number of constraints. The best model achieved a score of only 0.65, indicating substantial room for improvement. Furthermore, the specific type of constraint plays a critical role, with length-related constraints proving particularly challenging for LLMs.
Key takeaway
For AI Scientists and Machine Learning Engineers developing LLMs, you should prioritize improving instruction-following capabilities, particularly for tasks involving multiple and diverse constraints. Your models currently show significant performance drops with increased constraint complexity and struggle notably with length-related requirements. Consider fine-tuning models on datasets like WildIFEval to address these specific weaknesses and enhance real-world applicability, moving beyond simpler instruction sets.
Key insights
LLMs struggle with multi-constraint instructions, especially length, highlighting a need for improved instruction-following.
Principles
- LLM performance degrades with more constraints.
- Constraint type significantly impacts model performance.
- Length constraints are particularly challenging for LLMs.
Method
WildIFEval curation involves filtering Chatbot Arena data using Llama3.1-405b, then decomposing tasks into constraints with Llama3.1-70b, and classifying them into 8 types.
In practice
- Utilize WildIFEval to benchmark LLMs on complex, real-world instructions.
- Analyze specific constraint types to identify model weaknesses.
- Focus on improving LLM adherence to length-related constraints.
Topics
- Instruction Following
- LLM Benchmarking
- Multi-Constraint Generation
- Dataset Curation
- Constraint Taxonomy
- Natural Language Generation
Code references
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.