Mitigating Forgetting in Continual Learning with Selective Gradient Projection

2026-04-01 · Source: cs.LG updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

Algoverse AI Research introduces Selective Forgetting-Aware Optimization (SFAO), a dynamic method designed to mitigate catastrophic forgetting in continual learning environments. SFAO regulates gradient directions using cosine similarity and a per-layer gating mechanism, balancing model plasticity and stability. It selectively projects, accepts, or discards updates with a tunable mechanism, employing efficient Monte Carlo approximation. Experimental results on standard continual learning benchmarks, including MNIST and CIFAR datasets, demonstrate that SFAO achieves competitive accuracy while significantly reducing memory cost by 90% and improving forgetting metrics. This makes SFAO particularly suitable for resource-constrained scenarios and offers a more generalizable solution compared to regularization-based methods that often require more complex architectures like Wide ResNet-28x10 for stability.

Key takeaway

For research scientists developing continual learning models, SFAO offers a robust, memory-efficient alternative to traditional regularization or orthogonal gradient descent methods. You should consider integrating SFAO's similarity-gated update rule, especially in resource-constrained environments or when architectural flexibility is critical, as it demonstrates consistent performance across diverse model capacities and significantly reduces memory overhead compared to OGD.

Key insights

SFAO uses similarity-gated gradient updates to balance plasticity and stability in continual learning, reducing forgetting and memory.

Principles

Gradient interference causes catastrophic forgetting.
Orthogonal projection removes first-order forgetting.
Cosine similarity can gate gradient updates.

Method

SFAO maintains a buffer of past gradients and uses Monte Carlo sampling to calculate cosine alignment. Based on predefined thresholds, it accepts, projects, or discards the current gradient update per layer.

In practice

Implement per-layer gating for gradient updates.
Use Monte Carlo approximation for efficiency.
Tune cosine thresholds for stability-plasticity trade-off.

Topics

Selective Forgetting-Aware Optimization
Continual Learning
Catastrophic Forgetting
Gradient Projection
Cosine Similarity Gating

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.