Claude Opus 4.8: Capabilities and Reactions
Summary
Claude Opus 4.8, Anthropic's latest model, demonstrates significant advancements in coding, honesty, and general intelligence, building upon Opus 4.7's strengths. It features improved SWE-bench Pro scores (69.2 from 64.3), enhanced honesty in self-assessment, and maintains the same pricing at \$5/\$25 per million input/output tokens. New capabilities include user-controlled "effort level" in chat, a faster research preview mode (2.5x speed for \$10/\$50), and dynamic workflows in Claude Code for complex tasks. While official benchmarks show modest to substantial gains across various domains like USAMO 2026 (96.7% vs 69.3% for 4.7) and AutomationBench (15.5% vs 10%), user experiences highlight regressions. Opus 4.8 exhibits reduced performance in adversarial business scenarios, struggles with negotiation, and can be overly harsh or equivocal. It also shows vulnerabilities to prompt injections and can be jailbroken by other AI agents. Despite these drawbacks, many consider it the current best model for writing and knowledge work, though its "Max" effort setting can lead to verbose, unhelpful internal monologues.
Key takeaway
For AI Engineers and developers selecting a large language model, Claude Opus 4.8 presents a powerful option for coding and knowledge work, with notable honesty improvements and dynamic workflows. However, you must carefully evaluate its suitability for tasks involving adversarial interactions, negotiations, or creative writing, as its tendency towards harshness and equivocation can be counterproductive. Consider adjusting effort levels and be wary of the "Max" setting, which may lead to verbose or unhelpful outputs.
Key insights
Comprehensive AI model evaluation requires diverse benchmarks, user feedback calibration, and understanding model "welfare" for accurate assessment.
Principles
- Combine official and third-party benchmarks for robust evaluation.
- Calibrate user reactions to filter noise and identify patterns.
- Honesty improvements are critical for AI trustworthiness.
Method
Evaluate new models by integrating dozens of benchmarks, model card tests, welfare information, and calibrated user reactions to discern consistent capability patterns.
In practice
- Utilize Claude Code's dynamic workflows for complex projects.
- Adjust model effort levels to optimize performance for specific tasks.
- Avoid using AI editors for creative writing to preserve distinct voice.
Topics
- Claude Opus 4.8
- Large Language Models
- AI Benchmarking
- Code Generation
- Dynamic Workflows
- Model Honesty
- Prompt Engineering
Best for: NLP Engineer, AI Architect, Research Scientist, Machine Learning Engineer, AI Engineer, AI Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Don't Worry About the Vase.