Canonical Regularisation of Wide Feature-Learning Neural Networks

· Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences · Depth: Expert, extended

Summary

This research introduces a novel framework for understanding and regularizing wide neural networks in the feature-learning regime, a critical but less-studied area compared to kernel-regime networks. The authors demonstrate that traditional regularization methods like standard and anchored ridge regularization, which are effective in the kernel regime, bias gradient flow in feature-learning networks due to the curved geometry of their parameter space. This bias, which persists even with vanishing regularization, can degrade the inductive bias of pretrained networks. To address this, the paper axiomatizes a "canonical regularizer" as a regime-agnostic function-space energy, leading to the derivation of "geodesic ridge" for feature-learning networks and identifying the corresponding function-space prior as a "Riemannian Gibbs Process." For practical application, the authors propose "arc ridge," a scalable, minimax-robust surrogate to geodesic ridge, which is equivalent to early stopping under exact gradient flow. Empirical results on image processing (UTKFace with ResNet18) and NLP transfer-learning (Yelp Review with DistilBERT) problems validate the theory, showing that arc ridge preserves pretrained priors better than traditional methods at high regularization strengths.

Key takeaway

For research scientists developing or fine-tuning wide neural networks in feature-learning regimes, you should re-evaluate your regularization strategies. Traditional anchored ridge or weight decay can pathologically degrade pretrained representations by biasing gradient flow along output fibers. Instead, consider implementing "arc ridge" regularization, which is a scalable, minimax-robust surrogate for the theoretically sound "geodesic ridge." This approach respects the intrinsic geometry of feature learning, preserves inductive biases, and offers a principled alternative to early stopping, improving generalization, especially in transfer learning scenarios.

Key insights

Traditional ridge regularization biases feature-learning networks due to curved parameter-space geometry, necessitating a geodesic-based approach.

Principles

Method

The proposed method axiomatizes a canonical function-space energy, lifts it to parameter space, and derives "geodesic ridge" and its scalable surrogate "arc ridge" by studying Riemannian geometry.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Student

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.