No more Catastrophic Forgetting in SFT

2026-05-30 · Source: Discover AI · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, long

Summary

Entropy Kullback-Leibler Divergence Base Token Masking (EKSFT) is a new supervised fine-tuning (SFT) methodology designed to mitigate catastrophic forgetting and mode collapse in large language models. Developed by the University of Science and Knowledge of China, EKSFT identifies "viral" tokens with high Shannon entropy or high Kullback-Leibler (KL) divergence from a reference model. These high-risk tokens, which disproportionately cause parameter drift, are selectively masked. The method modifies the standard cross-entropy loss by applying a union operator to mask these critical tokens and introduces KL divergence and entropy regularization terms. Experimental results on a Q13 4 billion parameter model show an average improvement of 7% in Pass@1 and 5.1% in Pass@32 compared to standard SFT. When combined with DPO reinforcement learning, EKSFT outperforms standard SFT by 5.6% on Pass@32, though some individual benchmark improvements are less than 1%.

Key takeaway

For Machine Learning Engineers fine-tuning LLMs, EKSFT offers a method to mitigate catastrophic forgetting by selectively masking high-risk tokens. However, you should carefully evaluate if the modest performance gains, such as 5-7% on Pass@K benchmarks, justify the increased computational effort required for per-token entropy and KL divergence calculations. Consider the trade-off between performance uplift and resource investment for your specific application.

Key insights

EKSFT selectively masks high-entropy or high-KL divergence tokens during SFT to mitigate catastrophic forgetting and preserve generalization.

Principles

High-entropy tokens drive parameter drift.
KL divergence quantifies model distribution shift.
Selective masking preserves core LLM knowledge.

Method

EKSFT computes token-level entropy and KL divergence, identifies high-risk tokens using a top-K ratio, masks them, and applies KL divergence and entropy regularization to the masked tokens within a modified cross-entropy loss function.

In practice

Apply token-level entropy calculation.
Implement KL divergence regularization.
Combine masking with standard SFT.

Topics

Supervised Fine-Tuning
Catastrophic Forgetting
Kullback-Leibler Divergence
Shannon Entropy
Token Masking
LLM Training

Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Discover AI.