Why it’s getting harder to measure AI performance

2026-04-02 · Source: Understanding AI · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, short

Summary

The landscape of AI performance measurement is becoming increasingly complex as frontier models rapidly advance, challenging the efficacy of existing benchmarks. The METR chart, which measures AI models like GPT-3.5 (30 seconds), GPT-4 (4 minutes), o1 (40 minutes), GPT-5 (3 hours), and Claude Opus 4.6 (12 hours) based on human programmer task completion times, shows impressive progress but suffers from high noise and wide confidence intervals (e.g., 5 to 66 hours for Claude Opus 4.6) as models saturate its hardest tasks. Similarly, traditional benchmarks like MMLU, where GPT-3 scored 43.9% in 2020 and GPT-4.1 reached 90.2% in 2025, have saturated around 93% due to inherent question errors, necessitating new, harder evaluations like Humanity's Last Exam (HLE), which saw Gemini 3.1 score 44.7%. This saturation highlights a growing divergence between measurable capabilities and real-world task performance.

Key takeaway

For AI Scientists evaluating frontier models or designing new benchmarks, you must recognize that traditional metrics like MMLU are saturating, and even advanced benchmarks like METR face significant noise and cost challenges as models approach human-level performance on their hardest tasks. Your focus should shift towards developing novel evaluation paradigms that can reliably assess AI capabilities on complex, multi-week, real-world tasks, rather than relying solely on current, increasingly limited benchmarks. This will prevent misinterpreting progress and ensure meaningful comparisons.

Key insights

AI performance measurement faces increasing challenges from benchmark saturation and the difficulty of evaluating complex, real-world tasks.

Principles

AI benchmarks have a natural lifecycle, saturating as models improve.
Measuring AI on complex, open-ended tasks is inherently difficult.
Logarithmic scales can mask significant measurement noise in AI progress.

Method

METR measures AI model capabilities by quantifying the human programmer time required to complete specific software engineering tasks, ranging from seconds to hours, to establish a comparative difficulty scale.

In practice

Consider confidence intervals when interpreting benchmark scores.
Anticipate benchmark saturation for established evaluations.
Develop new benchmarks for multi-week, open-ended AI tasks.

Topics

AI Benchmarking
Model Evaluation
LLM Performance
Benchmark Saturation
METR
Humanity's Last Exam

Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, Director of AI/ML

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Understanding AI.