No more Catastrophic Forgetting in SFT

· Source: Discover AI · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, long

Summary

Entropy Kullback-Leibler Divergence Base Token Masking (EKSFT) is a new supervised fine-tuning (SFT) methodology designed to mitigate catastrophic forgetting and mode collapse in large language models. Developed by the University of Science and Knowledge of China, EKSFT identifies "viral" tokens with high Shannon entropy or high Kullback-Leibler (KL) divergence from a reference model. These high-risk tokens, which disproportionately cause parameter drift, are selectively masked. The method modifies the standard cross-entropy loss by applying a union operator to mask these critical tokens and introduces KL divergence and entropy regularization terms. Experimental results on a Q13 4 billion parameter model show an average improvement of 7% in Pass@1 and 5.1% in Pass@32 compared to standard SFT. When combined with DPO reinforcement learning, EKSFT outperforms standard SFT by 5.6% on Pass@32, though some individual benchmark improvements are less than 1%.

Key takeaway

For Machine Learning Engineers fine-tuning LLMs, EKSFT offers a method to mitigate catastrophic forgetting by selectively masking high-risk tokens. However, you should carefully evaluate if the modest performance gains, such as 5-7% on Pass@K benchmarks, justify the increased computational effort required for per-token entropy and KL divergence calculations. Consider the trade-off between performance uplift and resource investment for your specific application.

Key insights

EKSFT selectively masks high-entropy or high-KL divergence tokens during SFT to mitigate catastrophic forgetting and preserve generalization.

Principles

Method

EKSFT computes token-level entropy and KL divergence, identifies high-risk tokens using a top-K ratio, masks them, and applies KL divergence and entropy regularization to the masked tokens within a modified cross-entropy loss function.

In practice

Topics

Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Discover AI.