What Do Models Still Suck At? - Peter Gostev, Arena.ai, BullshitBench

2026-04-24 · Source: AI Engineer · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Software Development & Engineering · Depth: Advanced, long

Summary

An analysis of large language model (LLM) performance challenges the prevailing narrative of continuous, across-the-board improvement often depicted by benchmark charts. The "Busher Benchmark" evaluates models' ability to identify and push back against nonsense questions, revealing that many widely used models, including GPT and Gemini, only push back about 50% of the time. Anthropic's Claude 4.5 Sonnet and Haiku models show stronger performance in this area. Furthermore, the analysis of Arena's dataset, comprising over 5.5 million user votes since 2023, introduces a "dissatisfaction rate" metric. While this rate for top 25 models has decreased from 17% to 9% over time, indicating some progress, it highlights that nearly one in ten interactions with leading models still results in user dissatisfaction. This dissatisfaction is particularly pronounced in complex domains like gaming, finance, and law, suggesting a significant gap between narrow benchmark performance and real-world utility.

Key takeaway

For AI Product Managers evaluating LLM integration, you should look beyond traditional benchmarks and consider real-world dissatisfaction rates. Your focus should shift from solely optimizing for narrow, well-defined tasks to improving model performance across a broader distribution of complex, ambiguous user queries, especially in domains like gaming, finance, and law, where current models frequently fall short.

Key insights

LLMs still struggle with identifying nonsense and consistently meeting user expectations in complex, real-world tasks.

Principles

Benchmarks often fail to capture real-world model utility.
Model reasoning capabilities do not always improve performance.
User expectations evolve, influencing perceived model performance.

Method

The Busher Benchmark assesses LLMs by asking nonsense questions and grading their ability to push back. Arena's methodology uses user-submitted prompts and preference-based voting to derive a "dissatisfaction rate" for models.

In practice

Test LLMs with nonsensical queries to gauge robustness.
Monitor user dissatisfaction rates for real-world performance insights.
Recognize that "line goes up" benchmarks may not reflect all use cases.

Topics

LLM Benchmarking
Model Limitations
BullshitBench
Arena.ai Data
Dissatisfaction Rate

Best for: Research Scientist, AI Product Manager, AI Scientist, Machine Learning Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AI Engineer.