Visualizing and Benchmarking LLM Factual Hallucination Tendencies via Internal State Analysis and Clustering

2026-02-13 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, long

Summary

Researchers introduced FalseCite, a curated dataset of 82,000 false claims designed to benchmark Large Language Model (LLM) factual hallucination tendencies, particularly when induced by misleading or fabricated citations. Experiments with GPT-4o-mini, Falcon-7B, and Mistral-7B revealed a significant increase in hallucination rates for false claims paired with deceptive citations, with GPT-4o-mini showing the largest relative increase despite having the lowest baseline hallucination rate. The study also analyzed the internal hidden state vectors of hallucinating models, observing a distinct "horn-like" shape in these vectors regardless of hallucination status. FalseCite combines false claims from the FEVER and SciQ datasets, generating deceptive citations through random and semantic pairing strategies to create varied test conditions. The work highlights the critical role of citations in amplifying LLM hallucinations and offers a foundation for future research into mitigation.

Key takeaway

For research scientists developing or deploying LLMs in sensitive domains, you should prioritize evaluating your models against citation-driven hallucination benchmarks like FalseCite. The observed amplification of hallucinations by false citations, even in advanced models like GPT-4o-mini, indicates a critical vulnerability. You must implement robust verification mechanisms for cited information to prevent models from confidently propagating misinformation, especially when citations appear plausible.

Key insights

Fabricated citations significantly amplify LLM hallucination, especially in more robust models like GPT-4o-mini.

Principles

False citations increase LLM hallucination rates.
Semantic alignment of citations can make false claims more convincing.
Internal hidden states show distinct patterns during generation.

Method

FalseCite dataset creation involves pairing false claims from FEVER/SciQ with randomly or semantically matched fabricated citations. Hallucination is evaluated by an expert model (GPT-4.1) and internal hidden states are analyzed via Spearman correlation and k-means clustering.

In practice

Use FalseCite to evaluate LLM robustness against citation-driven hallucinations.
Monitor LLM responses for fabricated citations.
Analyze hidden states to understand hallucination mechanisms.

Topics

LLM Hallucination
Factual Benchmarking
FalseCite Dataset
Internal State Analysis
Hidden State Clustering

Best for: Research Scientist, AI Researcher, AI Scientist, Deep Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.