From Order to Distribution: A Spectral Characterization of Forgetting in Continual Learning

2026-04-16 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

This paper introduces a spectral theory to characterize forgetting in continual learning within an exact-fit overparameterized linear regression model. Shifting focus from task ordering to task distribution, the authors analyze how the generating distribution $\Pi$ governs forgetting when tasks are sampled i.i.d. They derive an exact operator identity for the forgetting quantity, $F^{\Pi}(k)$, revealing a recursive spectral structure. This framework yields an unconditional exponential upper bound, an explicit characterization of the leading asymptotic term, and sharp convergence rates up to constants in generic nondegenerate cases. The study further relates the rate-controlling quantity $\rho_{\Pi}$ to geometric properties of the task distribution, explaining why slow forgetting occurs when error directions remain weakly visible across tasks. Experimental results in synthetic linear settings with ambient dimension $d=192$ and task rank $r=48$ validate the theory, showing that the explicit bound tracks empirical forgetting and the analytic $\rho_{\Pi}$ matches observed decay speeds.

Key takeaway

For research scientists developing continual learning systems, understanding the spectral properties of task distributions is crucial. Your models' forgetting rates are intrinsically tied to the task distribution's geometry, specifically the rate-controlling quantity $\rho_{\Pi}$. You should prioritize designing task families that are rich and diverse, ensuring that error directions are adequately visible across tasks to achieve faster forgetting decay and more efficient learning, rather than solely focusing on task ordering strategies.

Key insights

Task distribution geometry, not just task order, fundamentally governs forgetting rates in continual learning.

Principles

Forgetting loss decays exponentially at spectral rate $\rho_{\Pi}$ in generic cases.
Slow forgetting correlates with error directions weakly visible across tasks.
Commuting projector families can exhibit zero actual forgetting.

Method

The method involves deriving an exact operator identity for the forgetting quantity $F^{\Pi}(k)$, followed by a spectral expansion to characterize decay scales and activation coefficients, relating $\rho_{\Pi}$ to task geometry.

In practice

Increase task diversity to reduce $\rho_{\Pi}$ and accelerate forgetting decay.
Ensure tasks probe complementary error directions for faster learning.
Consider task richness relative to model dimension for optimal performance.

Topics

Continual Learning
Forgetting Characterization
Spectral Theory
Overparameterized Linear Regression
Task Distribution Geometry

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.