Hallucination as Trajectory Commitment: Causal Evidence for Asymmetric Attractor Dynamics in Transformer Generation
Summary
Research on Qwen2.5-1.5B provides causal evidence that hallucination in autoregressive language models stems from early trajectory commitment, driven by asymmetric attractor dynamics. Using same-prompt bifurcation across 61 prompts, 27 (44.3%) showed factual and hallucinated trajectories diverging at the first generated token, with KL divergence exceeding 1.0 by step 1. Activation patching across 28 layers revealed a significant causal asymmetry: injecting a hallucinated activation into a correct trajectory corrupted output in 87.5% of trials (layer 20), while the reverse recovered only 33.3% (layer 24). Correction required sustained multi-step intervention, whereas corruption needed only a single perturbation. Prompt encoding's step-0 residual states predicted per-prompt hallucination rate with Pearson r = 0.776 at layer 15, indicating basin structure organized by regimes fixed at prompt encoding.
Key takeaway
For AI engineers developing or fine-tuning autoregressive language models, understanding that hallucination is an early, stable trajectory commitment is crucial. Your efforts to mitigate hallucination should focus on interventions at the prompt encoding stage or very early in the generation process, as correcting a hallucinated trajectory later requires significantly more coordinated and sustained effort across multiple layers and steps. Consider analyzing step-0 residual states to predict and prevent per-prompt hallucination.
Key insights
Hallucination in LLMs is an early, stable trajectory commitment, difficult to correct once initiated.
Principles
- Hallucination is an early commitment.
- Correction requires sustained intervention.
Method
Same-prompt bifurcation isolates trajectory dynamics. Activation patching and window patching reveal causal asymmetry. Step-0 residual state probing predicts hallucination rates.
In practice
- Intervene early to prevent hallucination.
- Sustained intervention is needed for correction.
Topics
- Hallucinations
- Autoregressive Language Models
- Attractor Dynamics
- Activation Patching
- Same-Prompt Bifurcation
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.