Visualizing and Benchmarking LLM Factual Hallucination Tendencies via Internal State Analysis and Clustering
Summary
Researchers introduced FalseCite, a curated dataset of 82,000 false claims designed to benchmark Large Language Model (LLM) factual hallucination tendencies, particularly when induced by misleading or fabricated citations. Experiments with GPT-4o-mini, Falcon-7B, and Mistral-7B revealed a significant increase in hallucination rates for false claims paired with deceptive citations, with GPT-4o-mini showing the largest relative increase despite having the lowest baseline hallucination rate. The study also analyzed the internal hidden state vectors of hallucinating models, observing a distinct "horn-like" shape in these vectors regardless of hallucination status. FalseCite combines false claims from the FEVER and SciQ datasets, generating deceptive citations through random and semantic pairing strategies to create varied test conditions. The work highlights the critical role of citations in amplifying LLM hallucinations and offers a foundation for future research into mitigation.
Key takeaway
For research scientists developing or deploying LLMs in sensitive domains, you should prioritize evaluating your models against citation-driven hallucination benchmarks like FalseCite. The observed amplification of hallucinations by false citations, even in advanced models like GPT-4o-mini, indicates a critical vulnerability. You must implement robust verification mechanisms for cited information to prevent models from confidently propagating misinformation, especially when citations appear plausible.
Key insights
Fabricated citations significantly amplify LLM hallucination, especially in more robust models like GPT-4o-mini.
Principles
- False citations increase LLM hallucination rates.
- Semantic alignment of citations can make false claims more convincing.
- Internal hidden states show distinct patterns during generation.
Method
FalseCite dataset creation involves pairing false claims from FEVER/SciQ with randomly or semantically matched fabricated citations. Hallucination is evaluated by an expert model (GPT-4.1) and internal hidden states are analyzed via Spearman correlation and k-means clustering.
In practice
- Use FalseCite to evaluate LLM robustness against citation-driven hallucinations.
- Monitor LLM responses for fabricated citations.
- Analyze hidden states to understand hallucination mechanisms.
Topics
- LLM Hallucination
- Factual Benchmarking
- FalseCite Dataset
- Internal State Analysis
- Hidden State Clustering
Best for: Research Scientist, AI Researcher, AI Scientist, Deep Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.