To Run or Not to Run: Analyzing the Cost-Effectiveness of Code Execution in LLM-Based Program Repair

2026-01-23 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

An empirical study analyzed the cost-effectiveness of code execution in LLM-based program repair, examining 7,745 agent traces from SWE-bench leaderboard submissions and 3,000 controlled repair attempts across 200 SWE-bench instances. Using agents like Claude Code, Codex, and open-source OpenCode with Qwen2.5-Coder-32B, the study evaluated four execution paradigms. Findings indicate agents average 8.8 test runs per task, with late-stage executions achieving 57.9% success. Crucially, execution restrictions had minimal impact on repair success: the resolve-rate gap between "Prohibited" and "Unrestricted" modes was only 1.25pp for commercial agents (not statistically significant, p>0.05) and ≈ 0pp for OpenCode. "Prohibited" mode saved 56–62% tokens and 48–54% wall-clock time for Claude Code, suggesting execution is often applied indiscriminately.

Key takeaway

For ML engineers designing or deploying LLM-based program repair agents, recognize that unrestricted code execution often provides minimal marginal benefit while incurring substantial token and wall-clock costs. Prioritize adaptive execution strategies that selectively engage testing only when high-value information gain is expected, rather than treating it as a default capability. This approach can significantly reduce operational expenses and environment setup overhead without compromising repair effectiveness.

Key insights

Code execution in LLM-based program repair often provides minimal benefit despite significant costs, suggesting indiscriminate application.

Principles

Code execution is a resource with explicit cost-benefit.
Late-stage executions yield higher success rates.
Partial execution access can be counterproductive.

Method

A two-stage empirical study analyzed 7,745 agent traces and conducted 3,000 controlled repair attempts across 200 SWE-bench instances under varying execution access paradigms.

In practice

Restrict execution for cost-sensitive LLM agents.
Adopt execution selectively, based on clear value.
Focus on enhancing execution feedback quality.

Topics

LLM Agents
Program Repair
Code Execution
Cost-Effectiveness
SWE-bench
Resource Management

Code references

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Software Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.