Gradient descent at the Edge of Stability: free energy model and kinetic description of the two-layer network
Summary
This study investigates the dynamics of gradient descent in the Edge of Stability regime, where large learning rates induce persistent loss and sharpness oscillations. It proposes a continuous-time effective model that tracks the evolution of the average trajectory coupled with the time-averaged covariance of its fast oscillations. The analysis reveals that an effective free energy, combining the original risk functional with a curvature-related "entropic" term, is the natural quantity to monitor in such unstable regimes. This model accurately tracks the envelope of oscillations, even when dynamics evolve on similar timescales as averaged weights, allowing for the tracking of training spikes. For wide two-layer neural networks, a mean-field limit yields a novel kinetic equation describing the joint distribution of weights and their fluctuations, interpretable as a Wasserstein-2 gradient flow of a macroscopic free energy. Numerical evidence on matrix factorization and CIFAR-10 tasks validates the model's accuracy and the predictive power of the effective free energy.
Key takeaway
For AI scientists optimizing neural networks with large learning rates, understanding the Edge of Stability regime is crucial. This research provides a robust framework, suggesting you should consider applying this effective free energy framework and kinetic description to better diagnose and predict complex gradient descent behaviors. Integrating this perspective into your training analysis tools can offer deeper insights into stability and convergence, especially when dealing with persistent loss oscillations.
Key insights
Effective free energy, combining risk and curvature, is key to understanding gradient descent dynamics at the Edge of Stability.
Principles
- Large learning rates induce persistent loss and sharpness oscillations in gradient descent.
- An effective free energy, not just risk, is the natural quantity to monitor in unstable regimes.
- The kinetic equation for two-layer networks can be interpreted as a Wasserstein-2 gradient flow.
Method
A continuous-time effective model tracks the average trajectory and time-averaged covariance of fast oscillations to monitor gradient descent dynamics in unstable regimes.
In practice
- Track oscillation envelopes and training spikes using the proposed effective model.
- Monitor the effective free energy to predict dynamics in unstable gradient descent.
- Apply the mean-field limit to analyze wide two-layer neural network training.
Topics
- Gradient Descent
- Edge of Stability
- Neural Network Training
- Free Energy Models
- Kinetic Equations
- Wasserstein Gradient Flow
- CIFAR-10
Best for: Research Scientist, AI Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.