Visualizing and Benchmarking LLM Factual Hallucination Tendencies via Internal State Analysis and Clustering

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, long

Summary

Researchers introduced FalseCite, a curated dataset of 82,000 false claims designed to benchmark Large Language Model (LLM) factual hallucination tendencies, particularly when induced by misleading or fabricated citations. Experiments with GPT-4o-mini, Falcon-7B, and Mistral-7B revealed a significant increase in hallucination rates for false claims paired with deceptive citations, with GPT-4o-mini showing the largest relative increase despite having the lowest baseline hallucination rate. The study also analyzed the internal hidden state vectors of hallucinating models, observing a distinct "horn-like" shape in these vectors regardless of hallucination status. FalseCite combines false claims from the FEVER and SciQ datasets, generating deceptive citations through random and semantic pairing strategies to create varied test conditions. The work highlights the critical role of citations in amplifying LLM hallucinations and offers a foundation for future research into mitigation.

Key takeaway

For research scientists developing or deploying LLMs in sensitive domains, you should prioritize evaluating your models against citation-driven hallucination benchmarks like FalseCite. The observed amplification of hallucinations by false citations, even in advanced models like GPT-4o-mini, indicates a critical vulnerability. You must implement robust verification mechanisms for cited information to prevent models from confidently propagating misinformation, especially when citations appear plausible.

Key insights

Fabricated citations significantly amplify LLM hallucination, especially in more robust models like GPT-4o-mini.

Principles

Method

FalseCite dataset creation involves pairing false claims from FEVER/SciQ with randomly or semantically matched fabricated citations. Hallucination is evaluated by an expert model (GPT-4.1) and internal hidden states are analyzed via Spearman correlation and k-means clustering.

In practice

Topics

Best for: Research Scientist, AI Researcher, AI Scientist, Deep Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.