Why it’s getting harder to measure AI performance
Summary
The landscape of AI performance measurement is becoming increasingly complex as frontier models rapidly advance, challenging the efficacy of existing benchmarks. The METR chart, which measures AI models like GPT-3.5 (30 seconds), GPT-4 (4 minutes), o1 (40 minutes), GPT-5 (3 hours), and Claude Opus 4.6 (12 hours) based on human programmer task completion times, shows impressive progress but suffers from high noise and wide confidence intervals (e.g., 5 to 66 hours for Claude Opus 4.6) as models saturate its hardest tasks. Similarly, traditional benchmarks like MMLU, where GPT-3 scored 43.9% in 2020 and GPT-4.1 reached 90.2% in 2025, have saturated around 93% due to inherent question errors, necessitating new, harder evaluations like Humanity's Last Exam (HLE), which saw Gemini 3.1 score 44.7%. This saturation highlights a growing divergence between measurable capabilities and real-world task performance.
Key takeaway
For AI Scientists evaluating frontier models or designing new benchmarks, you must recognize that traditional metrics like MMLU are saturating, and even advanced benchmarks like METR face significant noise and cost challenges as models approach human-level performance on their hardest tasks. Your focus should shift towards developing novel evaluation paradigms that can reliably assess AI capabilities on complex, multi-week, real-world tasks, rather than relying solely on current, increasingly limited benchmarks. This will prevent misinterpreting progress and ensure meaningful comparisons.
Key insights
AI performance measurement faces increasing challenges from benchmark saturation and the difficulty of evaluating complex, real-world tasks.
Principles
- AI benchmarks have a natural lifecycle, saturating as models improve.
- Measuring AI on complex, open-ended tasks is inherently difficult.
- Logarithmic scales can mask significant measurement noise in AI progress.
Method
METR measures AI model capabilities by quantifying the human programmer time required to complete specific software engineering tasks, ranging from seconds to hours, to establish a comparative difficulty scale.
In practice
- Consider confidence intervals when interpreting benchmark scores.
- Anticipate benchmark saturation for established evaluations.
- Develop new benchmarks for multi-week, open-ended AI tasks.
Topics
- AI Benchmarking
- Model Evaluation
- LLM Performance
- Benchmark Saturation
- METR
- Humanity's Last Exam
Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Understanding AI.