Canonical Regularisation of Wide Feature-Learning Neural Networks
Summary
This research introduces a novel framework for understanding and regularizing wide neural networks in the feature-learning regime, a critical but less-studied area compared to kernel-regime networks. The authors demonstrate that traditional regularization methods like standard and anchored ridge regularization, which are effective in the kernel regime, bias gradient flow in feature-learning networks due to the curved geometry of their parameter space. This bias, which persists even with vanishing regularization, can degrade the inductive bias of pretrained networks. To address this, the paper axiomatizes a "canonical regularizer" as a regime-agnostic function-space energy, leading to the derivation of "geodesic ridge" for feature-learning networks and identifying the corresponding function-space prior as a "Riemannian Gibbs Process." For practical application, the authors propose "arc ridge," a scalable, minimax-robust surrogate to geodesic ridge, which is equivalent to early stopping under exact gradient flow. Empirical results on image processing (UTKFace with ResNet18) and NLP transfer-learning (Yelp Review with DistilBERT) problems validate the theory, showing that arc ridge preserves pretrained priors better than traditional methods at high regularization strengths.
Key takeaway
For research scientists developing or fine-tuning wide neural networks in feature-learning regimes, you should re-evaluate your regularization strategies. Traditional anchored ridge or weight decay can pathologically degrade pretrained representations by biasing gradient flow along output fibers. Instead, consider implementing "arc ridge" regularization, which is a scalable, minimax-robust surrogate for the theoretically sound "geodesic ridge." This approach respects the intrinsic geometry of feature learning, preserves inductive biases, and offers a principled alternative to early stopping, improving generalization, especially in transfer learning scenarios.
Key insights
Traditional ridge regularization biases feature-learning networks due to curved parameter-space geometry, necessitating a geodesic-based approach.
Principles
- Network implicit prior is governed by gradient flow trajectory geometry.
- Anchored ridge fails in feature-learning due to manifold curvature.
- Geodesic ridge generalizes canonical regularization to curved geometries.
Method
The proposed method axiomatizes a canonical function-space energy, lifts it to parameter space, and derives "geodesic ridge" and its scalable surrogate "arc ridge" by studying Riemannian geometry.
In practice
- Use arc ridge for unbiased regularization in feature-learning networks.
- Arc ridge is computationally efficient, requiring only path length tracking.
- Consider arc ridge as a principled alternative to early stopping.
Topics
- Canonical Regularisation
- Feature-Learning Networks
- Geodesic Ridge
- Riemannian Gibbs Process
- Arc Ridge
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Student
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.