SWE-IF: Aligning Code Evaluation with Human Preference

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, short

Summary

SWE-IF: Aligning Code Evaluation with Human Preference introduces SWE-IF, a new testbed designed to evaluate Large Language Models' (LLMs) code generation capabilities beyond mere functional correctness, specifically addressing "vibe coding" where human preference includes non-functional aspects like code cleanliness and intent preservation. The research posits that instruction following is crucial for meeting human "vibe checks." To quantify this, the authors developed VeriCode, a taxonomy comprising 30 verifiable code instructions with deterministic verifiers. By augmenting existing evaluation suites with VeriCode, SWE-IF assesses both instruction following and functional correctness. An evaluation of 31 LLMs revealed that even top models struggle with multiple instructions and can exhibit functional regression. Crucially, a composite score combining functional correctness and instruction following showed the strongest correlation with human preference, with instruction following being the primary factor distinguishing LLM performance. The code, data, and taxonomy are publicly available.

Key takeaway

For Machine Learning Engineers developing or deploying code-generating LLMs, you should prioritize instruction following capabilities alongside functional correctness. Current pass@k metrics are insufficient; your models must adhere to non-functional requirements to meet human "vibe checks." Integrate evaluation tools like SWE-IF, which combines functional and instruction-following assessments, to accurately benchmark and improve your LLMs. Focusing on this composite score will yield models that better align with user preferences and reduce iterative refinement cycles.

Key insights

Human preference in code generation requires evaluating both functional correctness and instruction following, not just functionality.

Principles

Method

Develop a taxonomy of 30 verifiable code instructions (VeriCode) with deterministic verifiers. Augment existing evaluation suites to create a testbed (SWE-IF) assessing both instruction following and functional correctness.

In practice

Topics

Code references

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, Software Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.