Sonnet 5 review: I ran 64 generations to find out if it's worth it
Summary
The "How I AI Bench" evaluated Claude Sonnet 5 and four other frontier models—Opus 4.8, GPT 5.5, Sonnet 4.6, and Gemini 3 Pro—across 64 blind prototype generations, PRDs, and agent voice tests. Built with Claude Code, the benchmark aimed to move beyond "vibe checks" to repeatable, Clairvaux-graded assessments. Anthropic positions Sonnet 5 as its most agentic Sonnet model, offering Opus-level performance at Sonnet-level prices, specifically \$2 per million input tokens and \$10 per million output tokens until summer's end. Initial automated LLM judging placed Gemini 3 Pro and Sonnet 5 highest, but the human "vibe check" diverged significantly, favoring Sonnet 4.6. This discrepancy highlighted LLM judges' tendency towards middle-of-the-road scoring and lack of "taste," while human evaluation focused on visual quality over functional code. A final 70% human-weighted index ultimately ranked Sonnet 4.6 and Gemini 3 Pro highest, with Sonnet 5 and Opus 4.8 at the bottom of the author's preference.
Key takeaway
For AI Engineers or ML Directors evaluating new frontier models, your selection process should integrate human "vibe checks" with automated metrics. While LLM-based judges offer speed, they often lack the nuanced "taste" required for high-quality outputs like UI prototypes or agentic voice. Prioritize models based on task-specific strengths: use GPT 5.5 for PRDs, Sonnet 4.6 for simpler prototypes and conversational agents, and Opus 4.8 for complex UIs or codebases. Develop hybrid evaluation strategies to align model performance with your team's qualitative standards.
Key insights
Benchmarking frontier LLMs requires combining automated metrics with human "taste" to capture nuanced performance.
Principles
- Automated LLM judges often lack "taste" and rate to the middle.
- Model performance varies significantly by specific task.
- Blind scoring and rubrics improve benchmark objectivity.
Method
Build custom evaluation benchmarks using LLM code generation (e.g., Claude Code) to create tasks, generate outputs, and structure human "vibe checks" alongside automated LLM scoring.
In practice
- Integrate human "vibe checks" into LLM evaluation workflows.
- Use GPT 5.5 for comprehensive PRD generation.
- Employ Sonnet 4.6 for prototyping simpler designs or agentic chit-chat.
Topics
- Claude Sonnet 5
- LLM Benchmarking
- AI Agentic Capabilities
- UI Prototyping
- Human-in-the-Loop Evaluation
- Generative AI Pricing
Best for: CTO, VP of Engineering/Data, AI Architect, AI Engineer, Machine Learning Engineer, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Lenny's Newsletter.