Hallucination as Trajectory Commitment: Causal Evidence for Asymmetric Attractor Dynamics in Transformer Generation

2026-04-20 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, extended

Summary

Research on the Qwen2.5-1.5B language model provides causal evidence that hallucination is an "early trajectory commitment" governed by asymmetric attractor dynamics. Using a "same-prompt bifurcation" method, 27 out of 61 prompts (44.3%) spontaneously diverged into factual and hallucinated trajectories at the first generated token, with KL divergence increasing sharply. Activation patching revealed a significant causal asymmetry: injecting a hallucinated activation into a correct trajectory corrupted output in 87.5% of trials (layer 20), while the reverse (correcting a hallucinated trajectory) succeeded in only 33.3% (layer 24). This 2.6x gap indicates that hallucination acts as a locally stable attractor basin, easy to enter but difficult to escape, requiring sustained multi-step intervention for correction. Furthermore, step-0 residual states predict per-prompt hallucination rates (Pearson r=0.776 at layer 15), and unsupervised clustering identifies five regime-like groups, with a "saddle cluster" concentrating 12 of 13 bifurcating false-premise prompts.

Key takeaway

For AI Engineers and Research Scientists developing or deploying LLMs, this research indicates that hallucination is a deeply embedded dynamic, not merely a knowledge retrieval failure. You should prioritize early-stage intervention strategies, potentially at the prompt encoding level, to steer models away from hallucination-prone regimes. Be aware that single-point corrections are largely ineffective; robust solutions will require coordinated, multi-step interventions to escape established hallucination attractors.

Key insights

Hallucination in LLMs is an early, asymmetric attractor phenomenon, easy to trigger but hard to correct.

Principles

Hallucination commitment occurs at the first generated token.
Corruption is significantly easier than correction in LLM trajectories.
Prompt encoding pre-commits models to hallucination risk regimes.

Method

Same-prompt bifurcation isolates trajectory dynamics. Symmetric causal patching measures directional asymmetry. Unsupervised clustering of step-0 residual states identifies hallucination risk regimes.

In practice

Focus interventions on early generation steps.
Sustained, multi-step interventions are needed for correction.
Analyze prompt encoding to predict hallucination risk.

Topics

LLM Hallucination
Attractor Dynamics
Transformer Generation
Activation Patching
Same-Prompt Bifurcation

Code references

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.