Claude Opus 4.8: Capabilities and Reactions

2023-08-29 · Source: Don't Worry About the Vase · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, extended

Summary

Claude Opus 4.8, Anthropic's latest model, demonstrates significant advancements in coding, honesty, and general intelligence, building upon Opus 4.7's strengths. It features improved SWE-bench Pro scores (69.2 from 64.3), enhanced honesty in self-assessment, and maintains the same pricing at \$5/\$25 per million input/output tokens. New capabilities include user-controlled "effort level" in chat, a faster research preview mode (2.5x speed for \$10/\$50), and dynamic workflows in Claude Code for complex tasks. While official benchmarks show modest to substantial gains across various domains like USAMO 2026 (96.7% vs 69.3% for 4.7) and AutomationBench (15.5% vs 10%), user experiences highlight regressions. Opus 4.8 exhibits reduced performance in adversarial business scenarios, struggles with negotiation, and can be overly harsh or equivocal. It also shows vulnerabilities to prompt injections and can be jailbroken by other AI agents. Despite these drawbacks, many consider it the current best model for writing and knowledge work, though its "Max" effort setting can lead to verbose, unhelpful internal monologues.

Key takeaway

For AI Engineers and developers selecting a large language model, Claude Opus 4.8 presents a powerful option for coding and knowledge work, with notable honesty improvements and dynamic workflows. However, you must carefully evaluate its suitability for tasks involving adversarial interactions, negotiations, or creative writing, as its tendency towards harshness and equivocation can be counterproductive. Consider adjusting effort levels and be wary of the "Max" setting, which may lead to verbose or unhelpful outputs.

Key insights

Comprehensive AI model evaluation requires diverse benchmarks, user feedback calibration, and understanding model "welfare" for accurate assessment.

Principles

Combine official and third-party benchmarks for robust evaluation.
Calibrate user reactions to filter noise and identify patterns.
Honesty improvements are critical for AI trustworthiness.

Method

Evaluate new models by integrating dozens of benchmarks, model card tests, welfare information, and calibrated user reactions to discern consistent capability patterns.

In practice

Utilize Claude Code's dynamic workflows for complex projects.
Adjust model effort levels to optimize performance for specific tasks.
Avoid using AI editors for creative writing to preserve distinct voice.

Topics

Claude Opus 4.8
Large Language Models
AI Benchmarking
Code Generation
Dynamic Workflows
Model Honesty
Prompt Engineering

Best for: NLP Engineer, AI Architect, Research Scientist, Machine Learning Engineer, AI Engineer, AI Scientist

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Don't Worry About the Vase.