Surprise upset: GPT-5.5 beats Claude Fable 5 on brutal new Agents’ Last Exam benchmark

2026-06-10 · Source: VentureBeat · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Intermediate, short

Summary

The University of California, Berkeley's Center for Responsible, Decentralized Intelligence (RDI), with 300+ domain experts, launched Agents' Last Exam (ALE), a new benchmark for evaluating AI's ability to execute economically valuable, long-horizon professional workflows. OpenAI's GPT-5.5 (April release), operating via the Codex harness, achieved the top spot on the ALE Leaderboard with a 24.0% pass rate. This narrowly surpassed Anthropic's new Claude Fable 5, which scored 22.0%. ALE addresses historical benchmark flaws by using a Generalist Computer-Use Agent (GCUA) framework, requiring agents to navigate Linux/Windows VMs with shell scripting and point-and-click operations. It features 1,490 tasks, scaling to 5,000, derived from 55 non-physical industry sub-domains based on U.S. federal occupational taxonomy. The benchmark also combats contamination by keeping over 1,300 tasks private and rotating them.

Key takeaway

For AI Scientists and Machine Learning Engineers developing agentic systems, this new Agents' Last Exam (ALE) benchmark provides a crucial reality check. You should prioritize developing agents capable of complex, multi-modal interactions across diverse software environments, rather than optimizing for narrow, text-based benchmarks. Focus on robust workflow execution and contamination-resistant evaluation strategies to build truly valuable, production-ready AI agents.

Key insights

Agents' Last Exam (ALE) rigorously evaluates AI agents on real-world, long-horizon professional workflows, revealing current models' significant limitations.

Principles

Benchmarks must prevent "cheating" by models.
Real-world tasks require multi-modal interaction.
Evaluation data needs systematic rotation to avoid contamination.

Method

ALE forces agents into a Generalist Computer-Use Agent (GCUA) framework, requiring navigation of Linux/Windows VMs, interleaving shell scripting with point-and-click operations, and using deterministic, code-based evaluation for most tasks.

In practice

Test AI agents on multi-step, cross-application tasks.
Implement rotating private datasets for benchmark integrity.
Prioritize deterministic grading over LLM-as-a-judge.

Topics

AI Benchmarking
Agentic AI
GPT-5.5
Claude Fable 5
Workflow Automation
Benchmark Contamination
Generalist Computer-Use Agent

Best for: CTO, VP of Engineering/Data, AI Engineer, AI Scientist, Machine Learning Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by VentureBeat.