From Flat Facts to Sharp Hallucinations: Detecting Stubborn Errors via Gradient Sensitivity

· Source: cs.LG updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, extended

Summary

A new method, Embedding-Perturbed Gradient Sensitivity (EPGS), has been developed to detect "Stubborn Hallucinations" in Large Language Models (LLMs), which are factually incorrect predictions made with high confidence. Traditional hallucination detection methods, relying on predictive uncertainty or static internal representations, fail in these cases because the LLM appears confidently wrong. EPGS operates on the hypothesis that robust facts reside in "flat minima" within the loss landscape, while stubborn hallucinations occupy "sharp minima" due to brittle memorization. The method perturbs input embeddings with Gaussian noise and measures the resulting spike in gradient magnitude, serving as an efficient proxy for the Hessian spectrum. Experiments on Llama-2-7b, Llama-3-8b, and Mistral-7b-v0.1 across datasets like TriviaQA, SQuAD, NQ, and SVAMP show EPGS consistently outperforms entropy-based and representation-based baselines, achieving up to 0.9732 AUROC on reasoning tasks and significantly higher AUROC scores on stubborn hallucination subsets.

Key takeaway

For AI Engineers and Research Scientists deploying LLMs in critical domains, EPGS offers a robust mechanism to identify high-confidence factual errors that evade traditional detection. You should consider integrating EPGS into your model evaluation pipelines, especially for applications requiring high factual integrity, to filter out brittle memorizations and enhance model trustworthiness, despite the computational overhead of backward passes.

Key insights

EPGS detects confident LLM hallucinations by probing loss landscape curvature via input embedding perturbations.

Principles

Method

EPGS involves target entity masking, stochastic Gaussian noise injection into input embeddings, and measuring the magnitude and directional divergence of gradients in the last transformer block.

In practice

Topics

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.