Gradient descent at the Edge of Stability: free energy model and kinetic description of the two-layer network
Summary
A new continuous-time effective model is proposed to analyze gradient descent dynamics in the Edge of Stability (EoS) regime, where large learning rates induce persistent loss and sharpness oscillations. This model tracks the evolution of the average trajectory (θ) coupled with the time-averaged covariance (Σ) of its fast oscillations. The analysis introduces an "effective free energy" F(θ,Σ), combining the original risk functional with a curvature-related "entropic" term, as the natural quantity to monitor. The model accurately captures oscillation envelopes and explains sharpness increases, even when dynamics evolve on similar timescales as averaged weights. For wide two-layer neural networks, a mean-field limit yields a novel kinetic equation describing joint weight and fluctuation distributions. Numerical evidence on matrix factorization (d=2, L=3, η=0.077) and CIFAR-10 (n=500 images, 2-layer CNN/MLP, η=0.02) validates the model's accuracy and the predictive power of the effective free energy.
Key takeaway
For machine learning engineers and research scientists optimizing deep networks in the Edge of Stability regime, you should consider monitoring the proposed effective free energy F(θ,Σ) instead of solely the loss E(θ). This new continuous-time model offers a more accurate understanding of optimization dynamics, particularly for tracking oscillation envelopes and explaining sharpness increases. Implementing the coupled ODEs for θ and Σ can provide better predictive power for training behavior and potentially guide hyperparameter tuning towards solutions with improved generalization properties.
Key insights
A new model tracks average trajectory and oscillation covariance, revealing an effective free energy for Edge of Stability dynamics.
Principles
- Effective free energy F(θ,Σ) is the natural quantity to monitor in unstable optimization regimes.
- Edge of Stability's implicit bias drives optimizers towards flatter minima, improving generalization.
- Hessian's top eigenvectors exhibit local stability, informing loss landscape geometry at low energies.
Method
The model uses an ansatz θ̃_k = θ_k + √η δθ_k to derive coupled continuous-time ODEs for the average trajectory (θ) and the covariance of oscillations (Σ).
In practice
- Monitor F(θ,Σ) instead of E(θ) for better insight into EoS optimization.
- Initialize Σ by running relaxation steps, then sampling centered gradient steps.
- For high-dimensional problems, track top Hessian eigenvectors instead of the full Σ matrix.
Topics
- Gradient Descent
- Edge of Stability
- Neural Network Optimization
- Effective Free Energy
- Kinetic Equations
- Mean-Field Theory
Best for: AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.