From Weights to Features: SAE-Guided Activation Regularization for LLM Continual Learning
Summary
A new method, SAE-Guided Activation Regularization, addresses catastrophic forgetting in large language models (LLMs) during continual learning. Traditional weight-space regularization, like Elastic Weight Consolidation (EWC), underperforms in LLMs due to their "polysemantic" nature, where per-weight importance estimates are too coarse. This approach instead regularizes in the model's activation space, utilizing pretrained Sparse Autoencoders (SAEs) as a monosemantic feature dictionary. It derives a new loss function to balance stability and plasticity, with EWC shown as a special case. Unlike replay-based methods, it requires no previous-task data after an initial mask construction, retaining only a compact SAE feature mask. The method is also more memory efficient due to the feature space's lower dimensionality. It achieves the strongest results on TRACE and MedCL continual learning benchmarks among approaches without task-specific architectural components, outperforming traditional weight-space methods. Empirical evidence supports the polysemanticity thesis, showing task-relevant representations are linearly separable in the SAE feature basis but not in the weight basis.
Key takeaway
For Machine Learning Engineers developing continual learning strategies for large language models, traditional weight-space regularization methods like EWC are often insufficient. You should consider implementing SAE-Guided Activation Regularization, which uses monosemantic feature dictionaries to prevent catastrophic forgetting more effectively and memory-efficiently. This approach avoids storing previous-task data and has demonstrated superior performance on benchmarks like TRACE and MedCL, offering a robust alternative for adapting LLMs to new tasks without losing prior knowledge.
Key insights
SAE-guided activation regularization combats LLM catastrophic forgetting by using monosemantic features instead of polysemantic weights.
Principles
- LLM weights are polysemantic, hindering effective knowledge isolation.
- Regularizing in activation space with monosemantic features improves continual learning.
- Feature space regularization can balance stability and plasticity.
Method
Derive a loss function for activation-space regularization using pretrained SAEs to create a compact feature mask from current-task data, then retain only this mask for subsequent training.
In practice
- Apply SAEs to create feature masks for continual learning in LLMs.
- Evaluate continual learning methods on TRACE and MedCL benchmarks.
- Consider activation-space regularization for memory-efficient LLM adaptation.
Topics
- Continual Learning
- Large Language Models
- Sparse Autoencoders
- Catastrophic Forgetting
- Activation Regularization
- Polysemanticity
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.