SWE-IF: Aligning Code Evaluation with Human Preference
Summary
SWE-IF: Aligning Code Evaluation with Human Preference introduces SWE-IF, a new testbed designed to evaluate Large Language Models' (LLMs) code generation capabilities beyond mere functional correctness, specifically addressing "vibe coding" where human preference includes non-functional aspects like code cleanliness and intent preservation. The research posits that instruction following is crucial for meeting human "vibe checks." To quantify this, the authors developed VeriCode, a taxonomy comprising 30 verifiable code instructions with deterministic verifiers. By augmenting existing evaluation suites with VeriCode, SWE-IF assesses both instruction following and functional correctness. An evaluation of 31 LLMs revealed that even top models struggle with multiple instructions and can exhibit functional regression. Crucially, a composite score combining functional correctness and instruction following showed the strongest correlation with human preference, with instruction following being the primary factor distinguishing LLM performance. The code, data, and taxonomy are publicly available.
Key takeaway
For Machine Learning Engineers developing or deploying code-generating LLMs, you should prioritize instruction following capabilities alongside functional correctness. Current pass@k metrics are insufficient; your models must adhere to non-functional requirements to meet human "vibe checks." Integrate evaluation tools like SWE-IF, which combines functional and instruction-following assessments, to accurately benchmark and improve your LLMs. Focusing on this composite score will yield models that better align with user preferences and reduce iterative refinement cycles.
Key insights
Human preference in code generation requires evaluating both functional correctness and instruction following, not just functionality.
Principles
- Instruction following differentiates LLM code generation quality.
- Non-functional code attributes are critical for human satisfaction.
- Verifiable instructions enable objective "vibe check" evaluation.
Method
Develop a taxonomy of 30 verifiable code instructions (VeriCode) with deterministic verifiers. Augment existing evaluation suites to create a testbed (SWE-IF) assessing both instruction following and functional correctness.
In practice
- Integrate VeriCode into LLM code generation pipelines.
- Prioritize instruction following in LLM fine-tuning.
- Use SWE-IF to benchmark code LLMs.
Topics
- SWE-IF
- Code Generation
- LLM Evaluation
- Instruction Following
- Human Preference
- VeriCode
Code references
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, Software Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.