SWE-IF: Aligning Code Evaluation with Human Preference

2025-10-08 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, short

Summary

SWE-IF: Aligning Code Evaluation with Human Preference introduces SWE-IF, a new testbed designed to evaluate Large Language Models' (LLMs) code generation capabilities beyond mere functional correctness, specifically addressing "vibe coding" where human preference includes non-functional aspects like code cleanliness and intent preservation. The research posits that instruction following is crucial for meeting human "vibe checks." To quantify this, the authors developed VeriCode, a taxonomy comprising 30 verifiable code instructions with deterministic verifiers. By augmenting existing evaluation suites with VeriCode, SWE-IF assesses both instruction following and functional correctness. An evaluation of 31 LLMs revealed that even top models struggle with multiple instructions and can exhibit functional regression. Crucially, a composite score combining functional correctness and instruction following showed the strongest correlation with human preference, with instruction following being the primary factor distinguishing LLM performance. The code, data, and taxonomy are publicly available.

Key takeaway

For Machine Learning Engineers developing or deploying code-generating LLMs, you should prioritize instruction following capabilities alongside functional correctness. Current pass@k metrics are insufficient; your models must adhere to non-functional requirements to meet human "vibe checks." Integrate evaluation tools like SWE-IF, which combines functional and instruction-following assessments, to accurately benchmark and improve your LLMs. Focusing on this composite score will yield models that better align with user preferences and reduce iterative refinement cycles.

Key insights

Human preference in code generation requires evaluating both functional correctness and instruction following, not just functionality.

Principles

Instruction following differentiates LLM code generation quality.
Non-functional code attributes are critical for human satisfaction.
Verifiable instructions enable objective "vibe check" evaluation.

Method

Develop a taxonomy of 30 verifiable code instructions (VeriCode) with deterministic verifiers. Augment existing evaluation suites to create a testbed (SWE-IF) assessing both instruction following and functional correctness.

In practice

Integrate VeriCode into LLM code generation pipelines.
Prioritize instruction following in LLM fine-tuning.
Use SWE-IF to benchmark code LLMs.

Topics

SWE-IF
Code Generation
LLM Evaluation
Instruction Following
Human Preference
VeriCode

Code references

maszhongming/SWE-IF

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, Software Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.