New benchmark exposes how badly AI struggles with real knowledge work
Summary
The new AA-Briefcase benchmark from Artificial Analysis reveals that even leading AI models struggle significantly with realistic knowledge work, fully solving only 3 percent of tasks. This benchmark simulates multi-week projects using thousands of fragmented source files like Slack threads, emails, and data exports. Claude Fable 5, the top performer, achieved the highest rubric pass rate but still only met all criteria on 3 percent of tasks. Notably, on 31 out of 91 tasks, no model surpassed a 50 percent success rate. Weaker models typically fail due to basic execution errors, while stronger models miss subtle details requiring information synthesis from multiple sources. A substantial price disparity exists, with per-task costs ranging from approximately \$0.04 for DeepSeek V4 Flash to over \$31 for Claude Fable 5, an 800x difference.
Key takeaway
For AI Scientists and Machine Learning Engineers deploying models for complex knowledge work, you must critically evaluate model capabilities beyond basic task completion. Your current top-tier models, such as Claude Fable 5, only fully solve 3% of realistic multi-source tasks. This indicates a significant gap in advanced reasoning and synthesis. Prioritize benchmarks that simulate fragmented, multi-week projects to accurately assess performance and manage expectations regarding AI's current limitations in deep knowledge integration.
Key insights
Even top AI models fail at complex knowledge work, fully solving only 3% of tasks on a new benchmark.
Principles
- AI struggles with multi-source information synthesis.
- Model errors evolve with capability: execution vs. detail.
- Significant cost variance exists across AI models.
In practice
- Evaluate AI for tasks needing deep information synthesis.
- Consider cost-performance trade-offs for knowledge work.
- Anticipate different failure modes across model tiers.
Topics
- AI Benchmarking
- Knowledge Work Automation
- Claude Fable 5
- AI Model Performance
- Cost-Performance Analysis
- Information Synthesis
Best for: AI Architect, Research Scientist, AI Product Manager, AI Scientist, Machine Learning Engineer, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by The Decoder.