Many SWE-bench-Passing PRs Would Not Be Merged into Main

2026-03-10 · Source: METR · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, extended

Summary

A study evaluating AI-generated pull requests (PRs) on the SWE-bench Verified benchmark found that approximately half of the PRs passing automated tests would not be merged by human maintainers. Researchers recruited four active maintainers from three SWE-bench Verified repositories (scikit-learn, Sphinx, pytest) to review 296 AI-generated PRs from models like Claude 3.5 Sonnet, Claude 4 Opus, and GPT-5, alongside 47 human-written "golden patches." The findings indicate that automated grader scores are, on average, 24.2 percentage points higher than actual maintainer merge decisions. While the rate of improvement for maintainer merge decisions appears 9.6 pp/yr slower, this evidence is less robust. Rejection reasons included code quality, core functionality failures, and breaking other code. The study also revealed that automated graders significantly overstate the "time horizon" for task completion, by roughly 7 times.

Key takeaway

For AI Scientists and Machine Learning Engineers evaluating code generation models, you should critically assess automated benchmark scores like SWE-bench. Your models' reported success rates likely overestimate real-world mergeability by a significant margin, potentially 7x for time horizons. Prioritize developing agents that can iterate on feedback and adhere to human-centric code quality standards, rather than solely optimizing for automated test passage. This will align your agent's output with actual maintainer expectations.

Key insights

Automated code benchmarks like SWE-bench significantly overstate AI agent usefulness without human review and iteration.

Principles

Benchmark scores often misrepresent real-world utility.
Human maintainer review adds crucial quality gates.
AI agent iteration is vital for mergeable solutions.

Method

Maintainers reviewed AI-generated PRs passing automated tests, with results normalized against human-written "golden patches" to quantify real-world merge rates and rejection reasons.

In practice

Integrate human review into AI code generation pipelines.
Develop benchmarks with iterative feedback mechanisms.
Prioritize AI agent improvements in code quality.

Topics

SWE-bench Verified
AI Code Generation
Pull Request Review
Benchmark Evaluation
Maintainer Feedback
Code Quality

Best for: Research Scientist, AI Engineer, AI Product Manager, AI Scientist, Machine Learning Engineer, Director of AI/ML

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by METR.