The Bank Said GPT-5.5 Hallucinates Less. The Benchmark Said 86%. Here’s Why They’re Both Right.

· Source: Towards AI - Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, quick

Summary

GPT-5.5 was released, leading to conflicting public statements regarding its hallucination rate. Leigh-Ann Russell, CIO of the Bank of New York, praised the model's "impressive hallucination resistance," citing it as a significant improvement for scaling the bank's 220-plus AI use cases. However, six hours later, Artificial Analysis's independent AA-Omniscience benchmark reported GPT-5.5's hallucination rate at 86 percent, the highest among tested frontier models. For comparison, Claude Opus 4.7 registered 36 percent, and Gemini 3.1 Pro Preview showed 50 percent. These divergent claims highlight a critical discrepancy in how model performance is perceived and measured.

Key takeaway

For CTOs and VPs of Engineering evaluating new large language models for enterprise deployment, you must scrutinize vendor claims against independent, third-party benchmarks. Relying solely on anecdotal praise, even from prominent customers, risks integrating models with unacceptable hallucination rates, potentially undermining critical AI initiatives. Prioritize objective performance data to inform your procurement decisions and manage deployment risks effectively.

Key insights

Enterprise perception of AI model performance can significantly diverge from independent benchmark results.

Principles

In practice

Topics

Best for: CTO, VP of Engineering/Data, Executive, Director of AI/ML, AI Product Manager, Consultant

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Towards AI - Medium.