Why Open-Source & Chinese LLMs Lead Coding Benchmarks, But Struggle in the Real World?
Summary
The article introduces SWE-rebench, a new benchmark designed to evaluate large language models (LLMs) on their ability to solve fresh, unseen GitHub issues, contrasting it with the long-standing SWE-bench. While models like MiniMax M2.5 and Claude Opus 4.6 scored around 80% on SWE-bench, their performance significantly dropped on SWE-rebench, with Claude Code and Claude Opus 4.6 landing in the low 50% range and MiniMax M2.5 at 39.6%. This decline and the resulting shift in model rankings suggest that prolonged public exposure to a benchmark can lead to "benchmark saturation," where models optimize for the benchmark's specific distribution rather than true generalization. The SWE-rebench uses newly collected GitHub issues and a standardized evaluation framework to provide a more accurate measure of a model's real-world problem-solving capability.
Key takeaway
For AI Architects and CTOs evaluating LLMs for coding tasks, you should prioritize models that demonstrate strong performance on fresh, dynamic benchmarks like SWE-rebench. Relying solely on scores from long-public benchmarks like SWE-bench can lead to selecting models optimized for specific, known task distributions rather than true generalization. Your teams should consider Claude Code or gpt-5.2 (medium) as top candidates for their superior resolved rates on unseen problems, ensuring better real-world engineering performance.
Key insights
Benchmark saturation inflates LLM coding scores, necessitating fresh evaluation methods like SWE-rebench for true generalization.
Principles
- Benchmarks decay over time.
- Generalization is key for real-world utility.
Method
SWE-rebench evaluates LLMs on newly collected GitHub issues using a standardized framework with minimal ReAct-style scaffolding, identical prompts, and five independent runs per model to ensure task freshness and reduce evaluation noise.
In practice
- Prioritize models excelling on fresh benchmarks.
- Use Claude Code or gpt-5.2 for generalization.
- Be wary of long-public benchmark scores.
Topics
- LLM Evaluation
- Coding Benchmarks
- SWE-rebench
- Benchmark Saturation
- Model Generalization
Best for: AI Architect, CTO, VP of Engineering/Data, AI Engineer, Machine Learning Engineer, Software Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by To Data & Beyond.