The Real Costs of AI Agents Doing Human Jobs
Summary
The "Agents Last Exam," a new benchmark published June 3rd, 2026, by UC Berkeley and over 250 institutions, evaluates AI agents' real economic impact and industrial relevance across 13 industry clusters and nearly 1,500 tasks, moving beyond theoretical LLM tests. Overall, Codex GPT 5.5 leads with a 26.2% pass rate, followed by OpenClaw GPT 5.5 at 22.8%. Runtime and token costs vary significantly; Codex GPT 5.5 ran 81 hours, while OpenClaw DeepSeek v4 Pro took 235 hours. Failure analysis for Claude Code Opus 4.7 showed "wrong strategy" (30%) and "incompleteness" (17%) as dominant modes. For frontier difficulty tasks, GPT 5.5 models, like Codex GPT 5.5 (8.6%), significantly outperformed Claude Code Opus 4.7 (0%). Google also introduced a new Quantization Aware Training (QAT) methodology for Gamma 4 12B, reducing its size from 26.7 GB (BF16) to 8 GB (Q4O 4-bit) with improved quality.
Key takeaway
For Machine Learning Engineers evaluating AI agents for industrial deployment, prioritize models like GPT 5.5, which demonstrate superior performance on complex, economically relevant tasks, especially for frontier difficulty. Carefully assess the total cost of ownership, considering both pass rates and the significant runtime and token generation expenses. Additionally, explore Google's new Quantization Aware Training for Gamma 4 12B to deploy high-quality models on consumer-grade hardware, optimizing for efficiency without severe performance degradation.
Key insights
The "Agents Last Exam" reveals current AI agents struggle with industrial tasks, highlighting performance and cost disparities.
Principles
- Economic relevance requires industrial-grade benchmarks.
- Agent performance varies significantly by model and task difficulty.
- Quantization Aware Training improves model efficiency.
Method
The Agents Last Exam benchmark uses 1,500 task instances across 13 industrial clusters, evaluating AI agents on pass rate, runtime, token cost, and failure modes to determine economic impact.
In practice
- Prioritize GPT 5.5 models for frontier industrial tasks.
- Evaluate agent runtime and token costs for deployment.
- Consider QAT for efficient model deployment on consumer hardware.
Topics
- AI Agent Benchmarking
- Industrial AI Applications
- Large Language Models
- Model Quantization
- GPT 5.5
- Claude Opus
- Economic Impact of AI
Best for: AI Engineer, AI Architect, AI Product Manager, AI Scientist, Machine Learning Engineer, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Discover AI.