Position: Coding Benchmarks Are Misaligned with Agentic Software Engineering
Summary
A 2026 position paper argues that current coding benchmarks are misaligned with agentic software engineering, which increasingly relies on complex "system harnesses" rather than standalone large language models (LLMs). Benchmarks like SWE-Bench, HumanEval, MBPP, LiveCodeBench, and BigCodeBench, designed for a pre-agent era, conflate the LLM with the broader system harness, grade against single reference solutions, and lack component-level signal. For instance, Claude Opus 4.6's success rates on TerminalBench can vary by over 20 percentage points across different agent harnesses. Real-world acceptance rates for agent-authored pull requests (35-64%) significantly lag benchmark figures (>70% on Verified). The paper, presented at the Agentic Software Engineering (SE 3.0) Workshop, advocates for structural changes to evaluation methods.
Key takeaway
For AI Engineers and ML Scientists evaluating coding agents, recognize that single end-to-end benchmark scores are often insufficient and misleading. Your agent's true performance is a property of its entire "system harness"—including the model, tools, environment, and feedback—not just the underlying LLM. You should demand and implement evaluations that include detailed metadata, multi-shape behavioral verifiers, and component-level metrics to accurately assess and improve agent capabilities, preventing misattributions and suboptimal system selections.
Key insights
Current coding benchmarks fail to accurately evaluate agentic software engineering due to conflation, single-reference grading, and lack of component-level signal.
Principles
- Agentic systems are composite harnesses, not just models.
- Benchmarks implicitly shape research directions.
- Evaluation requires distinguishing construct from operationalization.
Method
Benchmark stewards should require harness-aware metadata, move from single-reference test sets to multi-shape behavioral verifiers, and develop component-level evaluation methods.
In practice
- Ablate non-model axes in agent evaluations.
- Use property tests or differential tests for grading.
- Evaluate agent components in isolation.
Topics
- Coding Agents
- Software Engineering
- LLM Benchmarking
- System Harnesses
- Agentic AI
- Evaluation Metrics
Code references
Best for: Research Scientist, AI Architect, AI Product Manager, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.