Why Open-Source & Chinese LLMs Lead Coding Benchmarks, But Struggle in the Real World?

· Source: To Data & Beyond · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, medium

Summary

The article introduces SWE-rebench, a new benchmark designed to evaluate large language models (LLMs) on their ability to solve fresh, unseen GitHub issues, contrasting it with the long-standing SWE-bench. While models like MiniMax M2.5 and Claude Opus 4.6 scored around 80% on SWE-bench, their performance significantly dropped on SWE-rebench, with Claude Code and Claude Opus 4.6 landing in the low 50% range and MiniMax M2.5 at 39.6%. This decline and the resulting shift in model rankings suggest that prolonged public exposure to a benchmark can lead to "benchmark saturation," where models optimize for the benchmark's specific distribution rather than true generalization. The SWE-rebench uses newly collected GitHub issues and a standardized evaluation framework to provide a more accurate measure of a model's real-world problem-solving capability.

Key takeaway

For AI Architects and CTOs evaluating LLMs for coding tasks, you should prioritize models that demonstrate strong performance on fresh, dynamic benchmarks like SWE-rebench. Relying solely on scores from long-public benchmarks like SWE-bench can lead to selecting models optimized for specific, known task distributions rather than true generalization. Your teams should consider Claude Code or gpt-5.2 (medium) as top candidates for their superior resolved rates on unseen problems, ensuring better real-world engineering performance.

Key insights

Benchmark saturation inflates LLM coding scores, necessitating fresh evaluation methods like SWE-rebench for true generalization.

Principles

Method

SWE-rebench evaluates LLMs on newly collected GitHub issues using a standardized framework with minimal ReAct-style scaffolding, identical prompts, and five independent runs per model to ensure task freshness and reduce evaluation noise.

In practice

Topics

Best for: AI Architect, CTO, VP of Engineering/Data, AI Engineer, Machine Learning Engineer, Software Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by To Data & Beyond.