SWE-Bench is getting replaced???

· Source: Theo - t3․gg · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, extended

Summary

The article discusses the shortcomings of existing AI coding benchmarks like SWE-Bench Pro, citing contamination, unrealistic problems, and flawed verification. It introduces DBSE (Deep SWE Bench), a new benchmark developed by Data Curve, designed to offer a more realistic evaluation of AI coding agents. DBSE features novel tasks, diverse languages (TypeScript, Go, Python), shorter prompts, and handwritten behavioral verifiers. Initial results from DBSE show OpenAI's GPT-55 performing significantly better at 70% success, followed by GPT-54 at 56% and Claude Opus at 54%. This contrasts sharply with SWE-Bench Pro's scores, which often showed smaller gaps and inflated performance for some models. DBSE also highlights cost inefficiencies, with Opus being over 3x more expensive than GPT-55 for lower performance. The author, an investor in Data Curve, emphasizes DBSE's value in confirming real-world developer experiences.

Key takeaway

For AI Engineers evaluating coding agents, recognize that traditional benchmarks like SWE-Bench Pro are compromised by contamination and unrealistic tasks. Prioritize models validated by behavior-focused benchmarks like DBSE, which reveal significant performance and cost disparities. You should design prompts that describe the problem and desired outcome, allowing agents to determine implementation, and consider creating your own mini-benchmarks from real-world failures to guide model selection.

Key insights

Existing AI coding benchmarks are flawed by contamination and unrealistic tasks, while DBSE offers a more accurate, behavior-focused evaluation.

Principles

Method

DBSE uses handwritten verifiers for software behavior, not implementation details, with tasks from scratch across diverse languages (TypeScript, Go, Python) and shorter, behavior-focused prompts.

In practice

Topics

Best for: CTO, VP of Engineering/Data, AI Architect, AI Engineer, Machine Learning Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Theo - t3․gg.