No more Catastrophic Forgetting in SFT
Summary
Entropy Kullback-Leibler Divergence Base Token Masking (EKSFT) is a new supervised fine-tuning (SFT) methodology designed to mitigate catastrophic forgetting and mode collapse in large language models. Developed by the University of Science and Knowledge of China, EKSFT identifies "viral" tokens with high Shannon entropy or high Kullback-Leibler (KL) divergence from a reference model. These high-risk tokens, which disproportionately cause parameter drift, are selectively masked. The method modifies the standard cross-entropy loss by applying a union operator to mask these critical tokens and introduces KL divergence and entropy regularization terms. Experimental results on a Q13 4 billion parameter model show an average improvement of 7% in Pass@1 and 5.1% in Pass@32 compared to standard SFT. When combined with DPO reinforcement learning, EKSFT outperforms standard SFT by 5.6% on Pass@32, though some individual benchmark improvements are less than 1%.
Key takeaway
For Machine Learning Engineers fine-tuning LLMs, EKSFT offers a method to mitigate catastrophic forgetting by selectively masking high-risk tokens. However, you should carefully evaluate if the modest performance gains, such as 5-7% on Pass@K benchmarks, justify the increased computational effort required for per-token entropy and KL divergence calculations. Consider the trade-off between performance uplift and resource investment for your specific application.
Key insights
EKSFT selectively masks high-entropy or high-KL divergence tokens during SFT to mitigate catastrophic forgetting and preserve generalization.
Principles
- High-entropy tokens drive parameter drift.
- KL divergence quantifies model distribution shift.
- Selective masking preserves core LLM knowledge.
Method
EKSFT computes token-level entropy and KL divergence, identifies high-risk tokens using a top-K ratio, masks them, and applies KL divergence and entropy regularization to the masked tokens within a modified cross-entropy loss function.
In practice
- Apply token-level entropy calculation.
- Implement KL divergence regularization.
- Combine masking with standard SFT.
Topics
- Supervised Fine-Tuning
- Catastrophic Forgetting
- Kullback-Leibler Divergence
- Shannon Entropy
- Token Masking
- LLM Training
Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Discover AI.