AI benchmarks are broken. Here’s what we need instead.
Summary
Current AI benchmarking methods, which primarily compare AI performance against individual humans on isolated tasks, fail to capture AI's true impact within complex organizational workflows. This misalignment leads to misunderstandings of AI capabilities, overlooked systemic risks, and misjudged economic and social consequences. Angela Aristidou, a professor at University College London, proposes a new framework called Human–AI, Context-Specific (HAIC) benchmarks. This approach shifts evaluation from individual task performance to team and workflow performance, expands the time horizon beyond one-off tests to long-term impacts, broadens outcome measures to include organizational outcomes and coordination quality, and considers upstream and downstream system effects. Examples from healthcare, such as FDA-approved medical AI models, illustrate how high benchmark scores do not translate to real-world productivity gains due to complex multidisciplinary team interactions and regulatory requirements.
Key takeaway
For AI Product Managers evaluating models for real-world deployment, you should move beyond isolated task-based benchmarks. Prioritize HAIC benchmarks that assess AI's performance within human teams and workflows over extended periods, focusing on organizational outcomes and system-level impacts. This will help you avoid the "AI graveyard" of abandoned models and build greater organizational confidence in AI.
Key insights
Current AI benchmarks misrepresent real-world performance by ignoring human teams and organizational context.
Principles
- AI performance emerges over extended use within human teams.
- System-level effects outweigh task-level accuracy in high-stakes contexts.
Method
HAIC benchmarks reframe evaluation by shifting the unit of analysis to team/workflow performance, expanding the time horizon, broadening outcome measures, and considering system effects.
In practice
- Assess AI's influence on collective reasoning and coordination.
- Evaluate AI systems longitudinally for error detectability.
Topics
- AI Benchmarking
- Human-AI Interaction
- Context-Specific Evaluation
- Organizational Workflows
- HAIC Benchmarks
Best for: Research Scientist, AI Product Manager, AI Scientist, Director of AI/ML, Consultant
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by MIT Technology Review.