From Weights to Features: SAE-Guided Activation Regularization for LLM Continual Learning

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, quick

Summary

A new method, SAE-Guided Activation Regularization, addresses catastrophic forgetting in large language models (LLMs) during continual learning. Traditional weight-space regularization, like Elastic Weight Consolidation (EWC), underperforms in LLMs due to their "polysemantic" nature, where per-weight importance estimates are too coarse. This approach instead regularizes in the model's activation space, utilizing pretrained Sparse Autoencoders (SAEs) as a monosemantic feature dictionary. It derives a new loss function to balance stability and plasticity, with EWC shown as a special case. Unlike replay-based methods, it requires no previous-task data after an initial mask construction, retaining only a compact SAE feature mask. The method is also more memory efficient due to the feature space's lower dimensionality. It achieves the strongest results on TRACE and MedCL continual learning benchmarks among approaches without task-specific architectural components, outperforming traditional weight-space methods. Empirical evidence supports the polysemanticity thesis, showing task-relevant representations are linearly separable in the SAE feature basis but not in the weight basis.

Key takeaway

For Machine Learning Engineers developing continual learning strategies for large language models, traditional weight-space regularization methods like EWC are often insufficient. You should consider implementing SAE-Guided Activation Regularization, which uses monosemantic feature dictionaries to prevent catastrophic forgetting more effectively and memory-efficiently. This approach avoids storing previous-task data and has demonstrated superior performance on benchmarks like TRACE and MedCL, offering a robust alternative for adapting LLMs to new tasks without losing prior knowledge.

Key insights

SAE-guided activation regularization combats LLM catastrophic forgetting by using monosemantic features instead of polysemantic weights.

Principles

Method

Derive a loss function for activation-space regularization using pretrained SAEs to create a compact feature mask from current-task data, then retain only this mask for subsequent training.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.