Claude Opus 4.8: Capabilities and Reactions

· Source: Don't Worry About the Vase · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, extended

Summary

Claude Opus 4.8, Anthropic's latest model, demonstrates significant advancements in coding, honesty, and general intelligence, building upon Opus 4.7's strengths. It features improved SWE-bench Pro scores (69.2 from 64.3), enhanced honesty in self-assessment, and maintains the same pricing at \$5/\$25 per million input/output tokens. New capabilities include user-controlled "effort level" in chat, a faster research preview mode (2.5x speed for \$10/\$50), and dynamic workflows in Claude Code for complex tasks. While official benchmarks show modest to substantial gains across various domains like USAMO 2026 (96.7% vs 69.3% for 4.7) and AutomationBench (15.5% vs 10%), user experiences highlight regressions. Opus 4.8 exhibits reduced performance in adversarial business scenarios, struggles with negotiation, and can be overly harsh or equivocal. It also shows vulnerabilities to prompt injections and can be jailbroken by other AI agents. Despite these drawbacks, many consider it the current best model for writing and knowledge work, though its "Max" effort setting can lead to verbose, unhelpful internal monologues.

Key takeaway

For AI Engineers and developers selecting a large language model, Claude Opus 4.8 presents a powerful option for coding and knowledge work, with notable honesty improvements and dynamic workflows. However, you must carefully evaluate its suitability for tasks involving adversarial interactions, negotiations, or creative writing, as its tendency towards harshness and equivocation can be counterproductive. Consider adjusting effort levels and be wary of the "Max" setting, which may lead to verbose or unhelpful outputs.

Key insights

Comprehensive AI model evaluation requires diverse benchmarks, user feedback calibration, and understanding model "welfare" for accurate assessment.

Principles

Method

Evaluate new models by integrating dozens of benchmarks, model card tests, welfare information, and calibrated user reactions to discern consistent capability patterns.

In practice

Topics

Best for: NLP Engineer, AI Architect, Research Scientist, Machine Learning Engineer, AI Engineer, AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Don't Worry About the Vase.