Surprise upset: GPT-5.5 beats Claude Fable 5 on brutal new Agents’ Last Exam benchmark
Summary
The University of California, Berkeley's Center for Responsible, Decentralized Intelligence (RDI), with 300+ domain experts, launched Agents' Last Exam (ALE), a new benchmark for evaluating AI's ability to execute economically valuable, long-horizon professional workflows. OpenAI's GPT-5.5 (April release), operating via the Codex harness, achieved the top spot on the ALE Leaderboard with a 24.0% pass rate. This narrowly surpassed Anthropic's new Claude Fable 5, which scored 22.0%. ALE addresses historical benchmark flaws by using a Generalist Computer-Use Agent (GCUA) framework, requiring agents to navigate Linux/Windows VMs with shell scripting and point-and-click operations. It features 1,490 tasks, scaling to 5,000, derived from 55 non-physical industry sub-domains based on U.S. federal occupational taxonomy. The benchmark also combats contamination by keeping over 1,300 tasks private and rotating them.
Key takeaway
For AI Scientists and Machine Learning Engineers developing agentic systems, this new Agents' Last Exam (ALE) benchmark provides a crucial reality check. You should prioritize developing agents capable of complex, multi-modal interactions across diverse software environments, rather than optimizing for narrow, text-based benchmarks. Focus on robust workflow execution and contamination-resistant evaluation strategies to build truly valuable, production-ready AI agents.
Key insights
Agents' Last Exam (ALE) rigorously evaluates AI agents on real-world, long-horizon professional workflows, revealing current models' significant limitations.
Principles
- Benchmarks must prevent "cheating" by models.
- Real-world tasks require multi-modal interaction.
- Evaluation data needs systematic rotation to avoid contamination.
Method
ALE forces agents into a Generalist Computer-Use Agent (GCUA) framework, requiring navigation of Linux/Windows VMs, interleaving shell scripting with point-and-click operations, and using deterministic, code-based evaluation for most tasks.
In practice
- Test AI agents on multi-step, cross-application tasks.
- Implement rotating private datasets for benchmark integrity.
- Prioritize deterministic grading over LLM-as-a-judge.
Topics
- AI Benchmarking
- Agentic AI
- GPT-5.5
- Claude Fable 5
- Workflow Automation
- Benchmark Contamination
- Generalist Computer-Use Agent
Best for: CTO, VP of Engineering/Data, AI Engineer, AI Scientist, Machine Learning Engineer, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by VentureBeat.