From Flat Facts to Sharp Hallucinations: Detecting Stubborn Errors via Gradient Sensitivity
Summary
A new method, Embedding-Perturbed Gradient Sensitivity (EPGS), has been developed to detect "Stubborn Hallucinations" in Large Language Models (LLMs), which are factually incorrect predictions made with high confidence. Traditional hallucination detection methods, relying on predictive uncertainty or static internal representations, fail in these cases because the LLM appears confidently wrong. EPGS operates on the hypothesis that robust facts reside in "flat minima" within the loss landscape, while stubborn hallucinations occupy "sharp minima" due to brittle memorization. The method perturbs input embeddings with Gaussian noise and measures the resulting spike in gradient magnitude, serving as an efficient proxy for the Hessian spectrum. Experiments on Llama-2-7b, Llama-3-8b, and Mistral-7b-v0.1 across datasets like TriviaQA, SQuAD, NQ, and SVAMP show EPGS consistently outperforms entropy-based and representation-based baselines, achieving up to 0.9732 AUROC on reasoning tasks and significantly higher AUROC scores on stubborn hallucination subsets.
Key takeaway
For AI Engineers and Research Scientists deploying LLMs in critical domains, EPGS offers a robust mechanism to identify high-confidence factual errors that evade traditional detection. You should consider integrating EPGS into your model evaluation pipelines, especially for applications requiring high factual integrity, to filter out brittle memorizations and enhance model trustworthiness, despite the computational overhead of backward passes.
Key insights
EPGS detects confident LLM hallucinations by probing loss landscape curvature via input embedding perturbations.
Principles
- Robust facts reside in flat minima.
- Stubborn hallucinations occupy sharp minima.
- Input sensitivity approximates Hessian sharpness.
Method
EPGS involves target entity masking, stochastic Gaussian noise injection into input embeddings, and measuring the magnitude and directional divergence of gradients in the last transformer block.
In practice
- Focus gradient analysis on key entities for accuracy.
- Use Gaussian noise for embedding perturbation.
- Extract gradients from the last transformer block.
Topics
- Stubborn Hallucinations
- Gradient Sensitivity
- Loss Landscape Geometry
- Embedding Perturbation
- Hallucination Detection
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.