Many SWE-bench-Passing PRs Would Not Be Merged into Main

· Source: METR · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, extended

Summary

A study evaluating AI-generated pull requests (PRs) on the SWE-bench Verified benchmark found that approximately half of the PRs passing automated tests would not be merged by human maintainers. Researchers recruited four active maintainers from three SWE-bench Verified repositories (scikit-learn, Sphinx, pytest) to review 296 AI-generated PRs from models like Claude 3.5 Sonnet, Claude 4 Opus, and GPT-5, alongside 47 human-written "golden patches." The findings indicate that automated grader scores are, on average, 24.2 percentage points higher than actual maintainer merge decisions. While the rate of improvement for maintainer merge decisions appears 9.6 pp/yr slower, this evidence is less robust. Rejection reasons included code quality, core functionality failures, and breaking other code. The study also revealed that automated graders significantly overstate the "time horizon" for task completion, by roughly 7 times.

Key takeaway

For AI Scientists and Machine Learning Engineers evaluating code generation models, you should critically assess automated benchmark scores like SWE-bench. Your models' reported success rates likely overestimate real-world mergeability by a significant margin, potentially 7x for time horizons. Prioritize developing agents that can iterate on feedback and adhere to human-centric code quality standards, rather than solely optimizing for automated test passage. This will align your agent's output with actual maintainer expectations.

Key insights

Automated code benchmarks like SWE-bench significantly overstate AI agent usefulness without human review and iteration.

Principles

Method

Maintainers reviewed AI-generated PRs passing automated tests, with results normalized against human-written "golden patches" to quantify real-world merge rates and rejection reasons.

In practice

Topics

Best for: Research Scientist, AI Engineer, AI Product Manager, AI Scientist, Machine Learning Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by METR.