Many SWE-bench-Passing PRs Would Not Be Merged into Main
Summary
A study evaluating AI-generated pull requests (PRs) on the SWE-bench Verified benchmark found that approximately half of the PRs passing automated tests would not be merged by human maintainers. Researchers recruited four active maintainers from three SWE-bench Verified repositories (scikit-learn, Sphinx, pytest) to review 296 AI-generated PRs from models like Claude 3.5 Sonnet, Claude 4 Opus, and GPT-5, alongside 47 human-written "golden patches." The findings indicate that automated grader scores are, on average, 24.2 percentage points higher than actual maintainer merge decisions. While the rate of improvement for maintainer merge decisions appears 9.6 pp/yr slower, this evidence is less robust. Rejection reasons included code quality, core functionality failures, and breaking other code. The study also revealed that automated graders significantly overstate the "time horizon" for task completion, by roughly 7 times.
Key takeaway
For AI Scientists and Machine Learning Engineers evaluating code generation models, you should critically assess automated benchmark scores like SWE-bench. Your models' reported success rates likely overestimate real-world mergeability by a significant margin, potentially 7x for time horizons. Prioritize developing agents that can iterate on feedback and adhere to human-centric code quality standards, rather than solely optimizing for automated test passage. This will align your agent's output with actual maintainer expectations.
Key insights
Automated code benchmarks like SWE-bench significantly overstate AI agent usefulness without human review and iteration.
Principles
- Benchmark scores often misrepresent real-world utility.
- Human maintainer review adds crucial quality gates.
- AI agent iteration is vital for mergeable solutions.
Method
Maintainers reviewed AI-generated PRs passing automated tests, with results normalized against human-written "golden patches" to quantify real-world merge rates and rejection reasons.
In practice
- Integrate human review into AI code generation pipelines.
- Develop benchmarks with iterative feedback mechanisms.
- Prioritize AI agent improvements in code quality.
Topics
- SWE-bench Verified
- AI Code Generation
- Pull Request Review
- Benchmark Evaluation
- Maintainer Feedback
- Code Quality
Best for: Research Scientist, AI Engineer, AI Product Manager, AI Scientist, Machine Learning Engineer, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by METR.