Sonnet 5 review: I ran 64 generations to find out if it's worth it

2026-06-29 · Source: Lenny's Newsletter · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, extended

Summary

The "How I AI Bench" evaluated Claude Sonnet 5 and four other frontier models—Opus 4.8, GPT 5.5, Sonnet 4.6, and Gemini 3 Pro—across 64 blind prototype generations, PRDs, and agent voice tests. Built with Claude Code, the benchmark aimed to move beyond "vibe checks" to repeatable, Clairvaux-graded assessments. Anthropic positions Sonnet 5 as its most agentic Sonnet model, offering Opus-level performance at Sonnet-level prices, specifically \$2 per million input tokens and \$10 per million output tokens until summer's end. Initial automated LLM judging placed Gemini 3 Pro and Sonnet 5 highest, but the human "vibe check" diverged significantly, favoring Sonnet 4.6. This discrepancy highlighted LLM judges' tendency towards middle-of-the-road scoring and lack of "taste," while human evaluation focused on visual quality over functional code. A final 70% human-weighted index ultimately ranked Sonnet 4.6 and Gemini 3 Pro highest, with Sonnet 5 and Opus 4.8 at the bottom of the author's preference.

Key takeaway

For AI Engineers or ML Directors evaluating new frontier models, your selection process should integrate human "vibe checks" with automated metrics. While LLM-based judges offer speed, they often lack the nuanced "taste" required for high-quality outputs like UI prototypes or agentic voice. Prioritize models based on task-specific strengths: use GPT 5.5 for PRDs, Sonnet 4.6 for simpler prototypes and conversational agents, and Opus 4.8 for complex UIs or codebases. Develop hybrid evaluation strategies to align model performance with your team's qualitative standards.

Key insights

Benchmarking frontier LLMs requires combining automated metrics with human "taste" to capture nuanced performance.

Principles

Automated LLM judges often lack "taste" and rate to the middle.
Model performance varies significantly by specific task.
Blind scoring and rubrics improve benchmark objectivity.

Method

Build custom evaluation benchmarks using LLM code generation (e.g., Claude Code) to create tasks, generate outputs, and structure human "vibe checks" alongside automated LLM scoring.

In practice

Integrate human "vibe checks" into LLM evaluation workflows.
Use GPT 5.5 for comprehensive PRD generation.
Employ Sonnet 4.6 for prototyping simpler designs or agentic chit-chat.

Topics

Claude Sonnet 5
LLM Benchmarking
AI Agentic Capabilities
UI Prototyping
Human-in-the-Loop Evaluation
Generative AI Pricing

Best for: CTO, VP of Engineering/Data, AI Architect, AI Engineer, Machine Learning Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Lenny's Newsletter.