Why it’s getting harder to measure AI performance

· Source: Understanding AI · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, short

Summary

The landscape of AI performance measurement is becoming increasingly complex as frontier models rapidly advance, challenging the efficacy of existing benchmarks. The METR chart, which measures AI models like GPT-3.5 (30 seconds), GPT-4 (4 minutes), o1 (40 minutes), GPT-5 (3 hours), and Claude Opus 4.6 (12 hours) based on human programmer task completion times, shows impressive progress but suffers from high noise and wide confidence intervals (e.g., 5 to 66 hours for Claude Opus 4.6) as models saturate its hardest tasks. Similarly, traditional benchmarks like MMLU, where GPT-3 scored 43.9% in 2020 and GPT-4.1 reached 90.2% in 2025, have saturated around 93% due to inherent question errors, necessitating new, harder evaluations like Humanity's Last Exam (HLE), which saw Gemini 3.1 score 44.7%. This saturation highlights a growing divergence between measurable capabilities and real-world task performance.

Key takeaway

For AI Scientists evaluating frontier models or designing new benchmarks, you must recognize that traditional metrics like MMLU are saturating, and even advanced benchmarks like METR face significant noise and cost challenges as models approach human-level performance on their hardest tasks. Your focus should shift towards developing novel evaluation paradigms that can reliably assess AI capabilities on complex, multi-week, real-world tasks, rather than relying solely on current, increasingly limited benchmarks. This will prevent misinterpreting progress and ensure meaningful comparisons.

Key insights

AI performance measurement faces increasing challenges from benchmark saturation and the difficulty of evaluating complex, real-world tasks.

Principles

Method

METR measures AI model capabilities by quantifying the human programmer time required to complete specific software engineering tasks, ranging from seconds to hours, to establish a comparative difficulty scale.

In practice

Topics

Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Understanding AI.