AI benchmarks are broken. Here’s what we need instead.

2026-03-31 · Source: MIT Technology Review · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Research Methodology & Innovation · Depth: Advanced, medium

Summary

Current AI benchmarking methods, which primarily compare AI performance against individual humans on isolated tasks, fail to capture AI's true impact within complex organizational workflows. This misalignment leads to misunderstandings of AI capabilities, overlooked systemic risks, and misjudged economic and social consequences. Angela Aristidou, a professor at University College London, proposes a new framework called Human–AI, Context-Specific (HAIC) benchmarks. This approach shifts evaluation from individual task performance to team and workflow performance, expands the time horizon beyond one-off tests to long-term impacts, broadens outcome measures to include organizational outcomes and coordination quality, and considers upstream and downstream system effects. Examples from healthcare, such as FDA-approved medical AI models, illustrate how high benchmark scores do not translate to real-world productivity gains due to complex multidisciplinary team interactions and regulatory requirements.

Key takeaway

For AI Product Managers evaluating models for real-world deployment, you should move beyond isolated task-based benchmarks. Prioritize HAIC benchmarks that assess AI's performance within human teams and workflows over extended periods, focusing on organizational outcomes and system-level impacts. This will help you avoid the "AI graveyard" of abandoned models and build greater organizational confidence in AI.

Key insights

Current AI benchmarks misrepresent real-world performance by ignoring human teams and organizational context.

Principles

AI performance emerges over extended use within human teams.
System-level effects outweigh task-level accuracy in high-stakes contexts.

Method

HAIC benchmarks reframe evaluation by shifting the unit of analysis to team/workflow performance, expanding the time horizon, broadening outcome measures, and considering system effects.

In practice

Assess AI's influence on collective reasoning and coordination.
Evaluate AI systems longitudinally for error detectability.

Topics

AI Benchmarking
Human-AI Interaction
Context-Specific Evaluation
Organizational Workflows
HAIC Benchmarks

Best for: Research Scientist, AI Product Manager, AI Scientist, Director of AI/ML, Consultant

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by MIT Technology Review.