MixSD: Mixed Contextual Self-Distillation for Knowledge Injection

2026-05-19 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Large Language Models · Depth: Expert, extended

Summary

MixSD is a novel, external-teacher-free method designed to inject new knowledge into large language models (LLMs) via supervised fine-tuning (SFT) while mitigating catastrophic forgetting. Standard SFT often degrades pretrained capabilities like reasoning and instruction following because its human- or external-system-generated targets diverge from the model's native autoregressive distribution. MixSD addresses this by dynamically constructing supervision targets, mixing tokens from an "expert conditional" (observing the fact in context) and a "naive conditional" (reflecting the model's original prior). This approach preserves factual learning signals while keeping supervision sequences closer to the base model's distribution. Evaluated on factual recall, arithmetic function acquisition, and knowledge editing across Qwen3 (1.7B, 4B, 8B) and Llama-3.2-1B-Instruct models, MixSD consistently achieves a superior memorization-retention trade-off, retaining up to 100% of base model capabilities compared to SFT's 1% retention, while maintaining near-perfect training accuracy.

Key takeaway

For AI Engineers and Research Scientists tasked with fine-tuning LLMs for domain-specific knowledge, MixSD offers a robust alternative to standard SFT. By aligning supervision targets with the model's native distribution, you can inject new facts or edit existing knowledge with near-perfect accuracy while preserving up to 100% of the model's original reasoning and general-domain capabilities, a significant improvement over SFT's typical 1% retention. Implement MixSD to avoid the severe degradation of pretrained capabilities and maintain model utility in specialized applications.

Key insights

Aligning supervision targets with an LLM's native generation distribution significantly mitigates catastrophic forgetting during knowledge injection.

Principles

Catastrophic forgetting stems from distribution mismatch in SFT targets.
Update direction, not magnitude, strongly predicts capability degradation.
Lower per-token NLL supervision reduces forgetting.

Method

MixSD dynamically constructs supervision by mixing tokens from an expert-conditioned rollout and a naive-conditioned rollout of the base model itself, using a mixing rate $\lambda$ for distribution alignment.

In practice

Use MixSD for knowledge injection to preserve LLM capabilities.
Consider $\lambda=0.3$ as a robust default mixing rate.
Apply MixSD for both novel fact injection and knowledge revision.

Topics

MixSD
Knowledge Injection
Catastrophic Forgetting
Supervised Fine-tuning
Self-Distillation

Code references

NVIDIA-NeMo/RL

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.