Maybe AI agents can be lawyers after all
Summary
Anthropic's Opus 4.6 model has significantly improved AI agent performance on professional tasks, as measured by Mercor's APEX-Agents benchmark. Initially, AI agents scored under 25% on tasks like law and corporate analysis, leading to conclusions that human professionals were safe from immediate displacement. However, Opus 4.6 achieved nearly 30% in one-shot trials and an average of 45% with multiple attempts, marking a substantial increase from the previous 18.4%. This improvement is attributed partly to new agentic features, including "agent swarms," designed for multistep problem-solving. Mercor CEO Brendan Foody noted the rapid progress, indicating that foundation model development continues at a fast pace.
Key takeaway
For CTOs and VPs of Engineering assessing AI integration into professional workflows, the rapid performance gains of models like Anthropic's Opus 4.6 on benchmarks like APEX-Agents signal a need to re-evaluate AI's potential impact. Your teams should begin exploring advanced agentic features, such as "agent swarms," to understand their applicability to complex, multistep professional tasks within your organization, rather than dismissing AI for such roles based on older benchmarks.
Key insights
AI agent performance on professional tasks is rapidly improving, challenging previous assumptions about human job security.
Principles
- AI capabilities evolve quickly
- Multistep problems benefit from agentic features
Method
Mercor's APEX-Agents benchmark measures AI agent performance on professional tasks like law and corporate analysis using one-shot and multi-attempt trials.
In practice
- Evaluate AI models with agentic features
- Monitor APEX-Agents leaderboard
Topics
- AI Agents
- Anthropic Opus 4.6
- Mercor Benchmark
- Professional Tasks
- Foundation Models
Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Product Manager, Tech Journalist, Legal Professional
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AI News & Artificial Intelligence | TechCrunch.