Why Fine-Tuning Encourages Hallucinations and How to Fix It

2026-04-21 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

Large language models (LLMs) often hallucinate factually incorrect statements, a problem exacerbated by supervised fine-tuning (SFT) when models acquire new factual information. This work reinterprets SFT-induced hallucinations as "factual forgetting," a form of catastrophic forgetting from continual learning. Researchers propose two mitigation strategies: reducing factual plasticity by freezing parameter groups, which is effective when new fact acquisition is not desired, and self-distillation, which enables new factual learning while minimizing forgetting. Self-distillation reduces factual forgetting from approximately 15% to 3%. The study investigates the underlying mechanism, finding that interference among overlapping semantic representations is a primary driver, rather than capacity limitations or behavior cloning. Experiments with Qwen 2.5 (1.5B, 8B) and LLaMA 3.1 (8B) models on the EntityQuestions dataset support these findings, showing that self-distillation mitigates this interference by regularizing output-distribution drift.

Key takeaway

For AI Engineers and Research Scientists developing or fine-tuning LLMs, understanding that SFT-induced hallucinations are a form of factual forgetting due to representational interference is critical. If your goal is task adaptation without new factual knowledge, selectively freezing FFN parameters can preserve existing knowledge. When new factual acquisition is necessary, implement self-distillation to reduce forgetting from ~15% to ~3% by stabilizing output distributions, ensuring both plasticity and stability.

Key insights

SFT-induced hallucinations stem from factual forgetting due to semantic interference, mitigable by self-distillation or parameter freezing.

Principles

Factual plasticity trades off with factual stability.
Semantic overlap drives representational interference.
Output-distribution drift causes factual forgetting.

Method

Self-distillation regularizes fine-tuning by constraining output-distribution shifts, using a frozen teacher model's output to guide the student, thereby limiting parameter updates that degrade existing knowledge.

In practice

Freeze FFN parameters to suppress new fact acquisition.
Apply self-distillation for new factual learning.
Use UUID-style identifiers to avoid semantic overlap.

Topics

Supervised Fine-Tuning
LLM Hallucinations
Factual Forgetting
Continual Learning
Self-Distillation

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.