GPT-5.5 tops benchmarks but still hallucinates frequently at a 20 percent higher API cost

2026-04-25 · Source: The Decoder · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, short

Summary

OpenAI's GPT-5.5, released April 24, 2026, leads the Artificial Analysis Intelligence Index with 60 points, surpassing Claude Opus 4.7 and Gemini 3.1 Pro Preview, both at 57 points. Despite its API price doubling, its 40 percent lower token usage compared to GPT-5.4 results in a net 20 percent price increase. While GPT-5.5 demonstrates strong price-performance, matching Claude Opus 4.7's scores at a quarter of the cost, it exhibits a significant hallucination problem. On the AA Omniscience benchmark, it achieves 57 percent accuracy but has an 86 percent hallucination rate, much higher than Claude Opus 4.7's 36 percent. Furthermore, on the BullshitBench, GPT-5.5 shows only a 45 percent pushback rate against nonsensical questions, similar to GPT-5.4, with GPT-5.5 Pro performing worse at 35 percent.

Key takeaway

For AI Engineers evaluating new large language models for critical applications, you should prioritize models that demonstrate robust pushback against nonsensical inputs, even if they don't top general intelligence benchmarks. While GPT-5.5 offers strong performance and token efficiency, its high hallucination rate and poor performance on the BullshitBench indicate a significant risk for factual inaccuracy. Consider models like Anthropic's Claude for tasks requiring higher reliability and less fabrication.

Key insights

Higher compute in LLMs does not automatically reduce hallucinations or improve reasoning against nonsense.

Principles

Benchmarks alone do not capture full model utility.
Token efficiency can offset API price increases.
Reasoning models may rationalize nonsense with more compute.

Method

The BullshitBench evaluates models by presenting plausible but illogical questions across five fields, scoring responses based on clear pushback, partial pushback, or acceptance of nonsense.

In practice

Evaluate LLMs with "bullshit" benchmarks.
Prioritize models with lower hallucination rates.
Consider token efficiency for cost-effective API usage.

Topics

GPT-5.5
AI Benchmarks
Hallucination Rate
API Cost
Token Efficiency

Best for: CTO, VP of Engineering/Data, AI Engineer, AI Scientist, Machine Learning Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The Decoder.