From Order to Distribution: A Spectral Characterization of Forgetting in Continual Learning
Summary
This paper introduces a spectral theory to characterize forgetting in continual learning within an exact-fit overparameterized linear regression model. Shifting focus from task ordering to task distribution, the authors analyze how the generating distribution $\Pi$ governs forgetting when tasks are sampled i.i.d. They derive an exact operator identity for the forgetting quantity, $F^{\Pi}(k)$, revealing a recursive spectral structure. This framework yields an unconditional exponential upper bound, an explicit characterization of the leading asymptotic term, and sharp convergence rates up to constants in generic nondegenerate cases. The study further relates the rate-controlling quantity $\rho_{\Pi}$ to geometric properties of the task distribution, explaining why slow forgetting occurs when error directions remain weakly visible across tasks. Experimental results in synthetic linear settings with ambient dimension $d=192$ and task rank $r=48$ validate the theory, showing that the explicit bound tracks empirical forgetting and the analytic $\rho_{\Pi}$ matches observed decay speeds.
Key takeaway
For research scientists developing continual learning systems, understanding the spectral properties of task distributions is crucial. Your models' forgetting rates are intrinsically tied to the task distribution's geometry, specifically the rate-controlling quantity $\rho_{\Pi}$. You should prioritize designing task families that are rich and diverse, ensuring that error directions are adequately visible across tasks to achieve faster forgetting decay and more efficient learning, rather than solely focusing on task ordering strategies.
Key insights
Task distribution geometry, not just task order, fundamentally governs forgetting rates in continual learning.
Principles
- Forgetting loss decays exponentially at spectral rate $\rho_{\Pi}$ in generic cases.
- Slow forgetting correlates with error directions weakly visible across tasks.
- Commuting projector families can exhibit zero actual forgetting.
Method
The method involves deriving an exact operator identity for the forgetting quantity $F^{\Pi}(k)$, followed by a spectral expansion to characterize decay scales and activation coefficients, relating $\rho_{\Pi}$ to task geometry.
In practice
- Increase task diversity to reduce $\rho_{\Pi}$ and accelerate forgetting decay.
- Ensure tasks probe complementary error directions for faster learning.
- Consider task richness relative to model dimension for optimal performance.
Topics
- Continual Learning
- Forgetting Characterization
- Spectral Theory
- Overparameterized Linear Regression
- Task Distribution Geometry
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.