SWE-bench February 2026 leaderboard update

2026-02-19 · Source: Simon Willison's Weblog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, short

Summary

The SWE-bench February 2026 leaderboard has been updated, providing independently verified performance metrics for current generation large language models on coding tasks. The "Bash Only" benchmark utilizes the mini-swe-bench agent, a 9,000-line Python tool, against the SWE-bench Verified dataset, a manually curated subset of 500 real-world coding problems from 12 open-source repositories, including Django and SymPy. Claude 4.5 Opus leads the benchmark with a 76.8% resolution rate, followed by Gemini 3 Flash and MiniMax M2.5, both at 75.8%. Notably, several Chinese models like MiniMax M2.5, GLM-5, Kimi K2.5, and DeepSeek V3.2 secured top ten positions. OpenAI's GPT-5.2 ranked sixth at 72.8%, though its specialized coding model, GPT-5.3-Codex, was not included. The benchmark uses a consistent system prompt for all models, focusing solely on model performance rather than prompt engineering.

Key takeaway

For AI Engineers evaluating large language models for code generation and bug fixing, the updated SWE-bench Verified leaderboard offers crucial, independently validated performance data. You should prioritize models like Claude 4.5 Opus (76.8%) or Gemini 3 Flash (75.8%) for their high problem resolution rates. Be aware that prompt engineering is not measured here, so further testing with optimized prompts may yield different results for your specific use cases.

Key insights

Independent SWE-bench Verified results show Claude 4.5 Opus leading in coding problem resolution.

Principles

Independent benchmarks offer unbiased model comparisons.
Consistent prompting ensures fair model evaluation.

Method

The benchmark uses the mini-swe-bench agent against the 500-sample SWE-bench Verified dataset, applying a uniform system prompt to evaluate model coding problem resolution.

In practice

Review SWE-bench Verified for model coding capabilities.
Consider Claude 4.5 Opus for code generation tasks.

Topics

SWE-bench
Code Generation Benchmarks
Large Language Models
LLM Performance
Browser Automation

Code references

Best for: AI Engineer, AI Scientist, Research Scientist, AI Researcher, Machine Learning Engineer, Prompt Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Simon Willison's Weblog.