To Run or Not to Run: Analyzing the Cost-Effectiveness of Code Execution in LLM-Based Program Repair
Summary
An empirical study analyzed the cost-effectiveness of code execution in LLM-based program repair, examining 7,745 agent traces from SWE-bench leaderboard submissions and 3,000 controlled repair attempts across 200 SWE-bench instances. Using agents like Claude Code, Codex, and open-source OpenCode with Qwen2.5-Coder-32B, the study evaluated four execution paradigms. Findings indicate agents average 8.8 test runs per task, with late-stage executions achieving 57.9% success. Crucially, execution restrictions had minimal impact on repair success: the resolve-rate gap between "Prohibited" and "Unrestricted" modes was only 1.25pp for commercial agents (not statistically significant, p>0.05) and ≈ 0pp for OpenCode. "Prohibited" mode saved 56–62% tokens and 48–54% wall-clock time for Claude Code, suggesting execution is often applied indiscriminately.
Key takeaway
For ML engineers designing or deploying LLM-based program repair agents, recognize that unrestricted code execution often provides minimal marginal benefit while incurring substantial token and wall-clock costs. Prioritize adaptive execution strategies that selectively engage testing only when high-value information gain is expected, rather than treating it as a default capability. This approach can significantly reduce operational expenses and environment setup overhead without compromising repair effectiveness.
Key insights
Code execution in LLM-based program repair often provides minimal benefit despite significant costs, suggesting indiscriminate application.
Principles
- Code execution is a resource with explicit cost-benefit.
- Late-stage executions yield higher success rates.
- Partial execution access can be counterproductive.
Method
A two-stage empirical study analyzed 7,745 agent traces and conducted 3,000 controlled repair attempts across 200 SWE-bench instances under varying execution access paradigms.
In practice
- Restrict execution for cost-sensitive LLM agents.
- Adopt execution selectively, based on clear value.
- Focus on enhancing execution feedback quality.
Topics
- LLM Agents
- Program Repair
- Code Execution
- Cost-Effectiveness
- SWE-bench
- Resource Management
Code references
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Software Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.