Why Fine-Tuning Encourages Hallucinations and How to Fix It
Summary
Large language models (LLMs) often hallucinate factually incorrect statements, a problem exacerbated by supervised fine-tuning (SFT) when models acquire new factual information. This work reinterprets SFT-induced hallucinations as "factual forgetting," a form of catastrophic forgetting from continual learning. Researchers propose two mitigation strategies: reducing factual plasticity by freezing parameter groups, which is effective when new fact acquisition is not desired, and self-distillation, which enables new factual learning while minimizing forgetting. Self-distillation reduces factual forgetting from approximately 15% to 3%. The study investigates the underlying mechanism, finding that interference among overlapping semantic representations is a primary driver, rather than capacity limitations or behavior cloning. Experiments with Qwen 2.5 (1.5B, 8B) and LLaMA 3.1 (8B) models on the EntityQuestions dataset support these findings, showing that self-distillation mitigates this interference by regularizing output-distribution drift.
Key takeaway
For AI Engineers and Research Scientists developing or fine-tuning LLMs, understanding that SFT-induced hallucinations are a form of factual forgetting due to representational interference is critical. If your goal is task adaptation without new factual knowledge, selectively freezing FFN parameters can preserve existing knowledge. When new factual acquisition is necessary, implement self-distillation to reduce forgetting from ~15% to ~3% by stabilizing output distributions, ensuring both plasticity and stability.
Key insights
SFT-induced hallucinations stem from factual forgetting due to semantic interference, mitigable by self-distillation or parameter freezing.
Principles
- Factual plasticity trades off with factual stability.
- Semantic overlap drives representational interference.
- Output-distribution drift causes factual forgetting.
Method
Self-distillation regularizes fine-tuning by constraining output-distribution shifts, using a frozen teacher model's output to guide the student, thereby limiting parameter updates that degrade existing knowledge.
In practice
- Freeze FFN parameters to suppress new fact acquisition.
- Apply self-distillation for new factual learning.
- Use UUID-style identifiers to avoid semantic overlap.
Topics
- Supervised Fine-Tuning
- LLM Hallucinations
- Factual Forgetting
- Continual Learning
- Self-Distillation
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.