The Smallest Model Won One of My Tests, and Other Things Benchmarks Won’t Tell You
Summary
A custom benchmark evaluated four Claude models (Haiku 4.5, Sonnet 4.6, Opus 4.8, and Fable 5), all run in June 2026, against four real-world tasks designed with hidden traps, rather than standard datasets. The tasks included an underspecified scheduling problem, a debugging scenario, analyzing a credit card statement for subscriptions and double charges, and rewriting corporate jargon under strict constraints. Unexpectedly, Haiku, the smallest model, excelled at the jargon rewrite by strictly following rules without fabricating facts. While all models found anomalies in the credit card task, only Opus and Fable correctly calibrated their confidence. Larger models demonstrated superior handling of dirty data by explicitly stating assumptions and showed greater consequence awareness in debugging. An accidental finding revealed that the author's account memory contaminated the experiment, highlighting the need for API use or memory deactivation for controlled comparisons.
Key takeaway
For AI Engineers or ML Directors evaluating new models, relying solely on benchmark scores is insufficient. You should develop custom, real-world test suites with domain-specific challenges and "dirty data" to assess model obedience, confidence calibration, and consequence awareness. This approach reveals how models handle ambiguity and critical details, ensuring they are safe and effective for production tasks, rather than just "smarter" on abstract metrics.
Key insights
Real-world AI performance diverges from benchmarks, revealing critical differences in obedience, calibration, and consequence awareness.
Principles
- Model capability, calibration, and obedience are distinct.
- Larger models may invent facts or embellish.
- Miscalibrated confidence is worse than wrong answers.
Method
A custom testing method involves four real-world tasks with hidden traps, identical prompts, and a pre-written scoring rubric, run in fresh conversations (ideally with memory off).
In practice
- Develop personal, domain-specific AI test files.
- Disable AI memory for controlled comparisons.
- Evaluate models for consequence awareness.
Topics
- AI Benchmarking
- Large Language Models
- Model Evaluation
- Claude Models
- AI Confidence Calibration
- Data Contamination
- Real-world AI Testing
Best for: AI Engineer, Machine Learning Engineer, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Towards AI - Medium.