Mitigating Catastrophic Forgetting in Target Language Adaptation of LLMs via Source-Shielded Updates
Summary
Researchers from the University of Sheffield, Hitachi, Ltd., and the University of Exeter introduced Source-Shielded Updates (SSU), a novel selective parameter update strategy designed to mitigate catastrophic forgetting when adapting instruct Large Language Models (LLMs) to new target languages using only unlabeled data. This method proactively preserves source knowledge by identifying and freezing parameters critical to the LLM's original capabilities before adaptation. Experiments conducted on 7B and 13B OLMo 2 Instruct models across five typologically diverse languages demonstrated that SSU reduced performance degradation on monolingual source tasks to an average of 3.4% for 7B models and 2.8% for 13B models, significantly outperforming full fine-tuning, which resulted in 20.3% and 22.3% degradation, respectively. SSU also achieved target-language performance competitive with, and often superior to, full fine-tuning across various benchmarks.
Key takeaway
For AI Engineers and Research Scientists working on multilingual LLM deployment, SSU offers a robust solution to expand linguistic diversity without sacrificing core model capabilities. By proactively shielding source knowledge, you can achieve strong target language performance while minimizing catastrophic forgetting, which is crucial for maintaining the general-purpose functionality of instruct models. Consider integrating SSU into your adaptation pipeline, especially when specialized instruction-tuning data for target languages is scarce or costly.
Key insights
Source-Shielded Updates (SSU) proactively freezes critical parameters to prevent catastrophic forgetting during LLM language adaptation.
Principles
- Proactive preservation of source knowledge is key.
- Column-wise freezing maintains feature transformations.
- Source-data-driven scoring identifies critical parameters.
Method
SSU involves three stages: scoring parameter importance using source data (e.g., Wanda), generating a column-wise freezing mask, and applying this mask during continual pre-training on unlabeled target language data.
In practice
- Adapt instruct LLMs with unlabeled target language data.
- Use small source data samples for parameter calibration.
- Apply column-wise freezing to preserve model structure.
Topics
- Catastrophic Forgetting
- Large Language Models
- Source-Shielded Updates
- Target Language Adaptation
- Selective Parameter Updates
Code references
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.