Sonnet 5 review: I ran 64 generations to find out if it's worth it

· Source: Lenny's Newsletter · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, extended

Summary

The "How I AI Bench" evaluated Claude Sonnet 5 and four other frontier models—Opus 4.8, GPT 5.5, Sonnet 4.6, and Gemini 3 Pro—across 64 blind prototype generations, PRDs, and agent voice tests. Built with Claude Code, the benchmark aimed to move beyond "vibe checks" to repeatable, Clairvaux-graded assessments. Anthropic positions Sonnet 5 as its most agentic Sonnet model, offering Opus-level performance at Sonnet-level prices, specifically \$2 per million input tokens and \$10 per million output tokens until summer's end. Initial automated LLM judging placed Gemini 3 Pro and Sonnet 5 highest, but the human "vibe check" diverged significantly, favoring Sonnet 4.6. This discrepancy highlighted LLM judges' tendency towards middle-of-the-road scoring and lack of "taste," while human evaluation focused on visual quality over functional code. A final 70% human-weighted index ultimately ranked Sonnet 4.6 and Gemini 3 Pro highest, with Sonnet 5 and Opus 4.8 at the bottom of the author's preference.

Key takeaway

For AI Engineers or ML Directors evaluating new frontier models, your selection process should integrate human "vibe checks" with automated metrics. While LLM-based judges offer speed, they often lack the nuanced "taste" required for high-quality outputs like UI prototypes or agentic voice. Prioritize models based on task-specific strengths: use GPT 5.5 for PRDs, Sonnet 4.6 for simpler prototypes and conversational agents, and Opus 4.8 for complex UIs or codebases. Develop hybrid evaluation strategies to align model performance with your team's qualitative standards.

Key insights

Benchmarking frontier LLMs requires combining automated metrics with human "taste" to capture nuanced performance.

Principles

Method

Build custom evaluation benchmarks using LLM code generation (e.g., Claude Code) to create tasks, generate outputs, and structure human "vibe checks" alongside automated LLM scoring.

In practice

Topics

Best for: CTO, VP of Engineering/Data, AI Architect, AI Engineer, Machine Learning Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Lenny's Newsletter.