GPT-5.5 tops benchmarks but still hallucinates frequently at a 20 percent higher API cost
Summary
OpenAI's GPT-5.5, released April 24, 2026, leads the Artificial Analysis Intelligence Index with 60 points, surpassing Claude Opus 4.7 and Gemini 3.1 Pro Preview, both at 57 points. Despite its API price doubling, its 40 percent lower token usage compared to GPT-5.4 results in a net 20 percent price increase. While GPT-5.5 demonstrates strong price-performance, matching Claude Opus 4.7's scores at a quarter of the cost, it exhibits a significant hallucination problem. On the AA Omniscience benchmark, it achieves 57 percent accuracy but has an 86 percent hallucination rate, much higher than Claude Opus 4.7's 36 percent. Furthermore, on the BullshitBench, GPT-5.5 shows only a 45 percent pushback rate against nonsensical questions, similar to GPT-5.4, with GPT-5.5 Pro performing worse at 35 percent.
Key takeaway
For AI Engineers evaluating new large language models for critical applications, you should prioritize models that demonstrate robust pushback against nonsensical inputs, even if they don't top general intelligence benchmarks. While GPT-5.5 offers strong performance and token efficiency, its high hallucination rate and poor performance on the BullshitBench indicate a significant risk for factual inaccuracy. Consider models like Anthropic's Claude for tasks requiring higher reliability and less fabrication.
Key insights
Higher compute in LLMs does not automatically reduce hallucinations or improve reasoning against nonsense.
Principles
- Benchmarks alone do not capture full model utility.
- Token efficiency can offset API price increases.
- Reasoning models may rationalize nonsense with more compute.
Method
The BullshitBench evaluates models by presenting plausible but illogical questions across five fields, scoring responses based on clear pushback, partial pushback, or acceptance of nonsense.
In practice
- Evaluate LLMs with "bullshit" benchmarks.
- Prioritize models with lower hallucination rates.
- Consider token efficiency for cost-effective API usage.
Topics
- GPT-5.5
- AI Benchmarks
- Hallucination Rate
- API Cost
- Token Efficiency
Best for: CTO, VP of Engineering/Data, AI Engineer, AI Scientist, Machine Learning Engineer, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by The Decoder.