Position: Coding Benchmarks Are Misaligned with Agentic Software Engineering

2026-03-10 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

A 2026 position paper argues that current coding benchmarks are misaligned with agentic software engineering, which increasingly relies on complex "system harnesses" rather than standalone large language models (LLMs). Benchmarks like SWE-Bench, HumanEval, MBPP, LiveCodeBench, and BigCodeBench, designed for a pre-agent era, conflate the LLM with the broader system harness, grade against single reference solutions, and lack component-level signal. For instance, Claude Opus 4.6's success rates on TerminalBench can vary by over 20 percentage points across different agent harnesses. Real-world acceptance rates for agent-authored pull requests (35-64%) significantly lag benchmark figures (>70% on Verified). The paper, presented at the Agentic Software Engineering (SE 3.0) Workshop, advocates for structural changes to evaluation methods.

Key takeaway

For AI Engineers and ML Scientists evaluating coding agents, recognize that single end-to-end benchmark scores are often insufficient and misleading. Your agent's true performance is a property of its entire "system harness"—including the model, tools, environment, and feedback—not just the underlying LLM. You should demand and implement evaluations that include detailed metadata, multi-shape behavioral verifiers, and component-level metrics to accurately assess and improve agent capabilities, preventing misattributions and suboptimal system selections.

Key insights

Current coding benchmarks fail to accurately evaluate agentic software engineering due to conflation, single-reference grading, and lack of component-level signal.

Principles

Agentic systems are composite harnesses, not just models.
Benchmarks implicitly shape research directions.
Evaluation requires distinguishing construct from operationalization.

Method

Benchmark stewards should require harness-aware metadata, move from single-reference test sets to multi-shape behavioral verifiers, and develop component-level evaluation methods.

In practice

Ablate non-model axes in agent evaluations.
Use property tests or differential tests for grading.
Evaluate agent components in isolation.

Topics

Coding Agents
Software Engineering
LLM Benchmarking
System Harnesses
Agentic AI
Evaluation Metrics

Code references

Best for: Research Scientist, AI Architect, AI Product Manager, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.