New benchmark exposes how badly AI struggles with real knowledge work

2026-06-19 · Source: The Decoder · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Intermediate, quick

Summary

The new AA-Briefcase benchmark from Artificial Analysis reveals that even leading AI models struggle significantly with realistic knowledge work, fully solving only 3 percent of tasks. This benchmark simulates multi-week projects using thousands of fragmented source files like Slack threads, emails, and data exports. Claude Fable 5, the top performer, achieved the highest rubric pass rate but still only met all criteria on 3 percent of tasks. Notably, on 31 out of 91 tasks, no model surpassed a 50 percent success rate. Weaker models typically fail due to basic execution errors, while stronger models miss subtle details requiring information synthesis from multiple sources. A substantial price disparity exists, with per-task costs ranging from approximately \$0.04 for DeepSeek V4 Flash to over \$31 for Claude Fable 5, an 800x difference.

Key takeaway

For AI Scientists and Machine Learning Engineers deploying models for complex knowledge work, you must critically evaluate model capabilities beyond basic task completion. Your current top-tier models, such as Claude Fable 5, only fully solve 3% of realistic multi-source tasks. This indicates a significant gap in advanced reasoning and synthesis. Prioritize benchmarks that simulate fragmented, multi-week projects to accurately assess performance and manage expectations regarding AI's current limitations in deep knowledge integration.

Key insights

Even top AI models fail at complex knowledge work, fully solving only 3% of tasks on a new benchmark.

Principles

AI struggles with multi-source information synthesis.
Model errors evolve with capability: execution vs. detail.
Significant cost variance exists across AI models.

In practice

Evaluate AI for tasks needing deep information synthesis.
Consider cost-performance trade-offs for knowledge work.
Anticipate different failure modes across model tiers.

Topics

AI Benchmarking
Knowledge Work Automation
Claude Fable 5
AI Model Performance
Cost-Performance Analysis
Information Synthesis

Best for: AI Architect, Research Scientist, AI Product Manager, AI Scientist, Machine Learning Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The Decoder.