SWE-bench February 2026 leaderboard update
Summary
The SWE-bench February 2026 leaderboard has been updated, providing independently verified performance metrics for current generation large language models on coding tasks. The "Bash Only" benchmark utilizes the mini-swe-bench agent, a 9,000-line Python tool, against the SWE-bench Verified dataset, a manually curated subset of 500 real-world coding problems from 12 open-source repositories, including Django and SymPy. Claude 4.5 Opus leads the benchmark with a 76.8% resolution rate, followed by Gemini 3 Flash and MiniMax M2.5, both at 75.8%. Notably, several Chinese models like MiniMax M2.5, GLM-5, Kimi K2.5, and DeepSeek V3.2 secured top ten positions. OpenAI's GPT-5.2 ranked sixth at 72.8%, though its specialized coding model, GPT-5.3-Codex, was not included. The benchmark uses a consistent system prompt for all models, focusing solely on model performance rather than prompt engineering.
Key takeaway
For AI Engineers evaluating large language models for code generation and bug fixing, the updated SWE-bench Verified leaderboard offers crucial, independently validated performance data. You should prioritize models like Claude 4.5 Opus (76.8%) or Gemini 3 Flash (75.8%) for their high problem resolution rates. Be aware that prompt engineering is not measured here, so further testing with optimized prompts may yield different results for your specific use cases.
Key insights
Independent SWE-bench Verified results show Claude 4.5 Opus leading in coding problem resolution.
Principles
- Independent benchmarks offer unbiased model comparisons.
- Consistent prompting ensures fair model evaluation.
Method
The benchmark uses the mini-swe-bench agent against the 500-sample SWE-bench Verified dataset, applying a uniform system prompt to evaluate model coding problem resolution.
In practice
- Review SWE-bench Verified for model coding capabilities.
- Consider Claude 4.5 Opus for code generation tasks.
Topics
- SWE-bench
- Code Generation Benchmarks
- Large Language Models
- LLM Performance
- Browser Automation
Code references
Best for: AI Engineer, AI Scientist, Research Scientist, AI Researcher, Machine Learning Engineer, Prompt Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Simon Willison's Weblog.