Gradient descent at the Edge of Stability: free energy model and kinetic description of the two-layer network

2026-06-03 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

This study investigates the dynamics of gradient descent in the Edge of Stability regime, where large learning rates induce persistent loss and sharpness oscillations. It proposes a continuous-time effective model that tracks the evolution of the average trajectory coupled with the time-averaged covariance of its fast oscillations. The analysis reveals that an effective free energy, combining the original risk functional with a curvature-related "entropic" term, is the natural quantity to monitor in such unstable regimes. This model accurately tracks the envelope of oscillations, even when dynamics evolve on similar timescales as averaged weights, allowing for the tracking of training spikes. For wide two-layer neural networks, a mean-field limit yields a novel kinetic equation describing the joint distribution of weights and their fluctuations, interpretable as a Wasserstein-2 gradient flow of a macroscopic free energy. Numerical evidence on matrix factorization and CIFAR-10 tasks validates the model's accuracy and the predictive power of the effective free energy.

Key takeaway

For AI scientists optimizing neural networks with large learning rates, understanding the Edge of Stability regime is crucial. This research provides a robust framework, suggesting you should consider applying this effective free energy framework and kinetic description to better diagnose and predict complex gradient descent behaviors. Integrating this perspective into your training analysis tools can offer deeper insights into stability and convergence, especially when dealing with persistent loss oscillations.

Key insights

Effective free energy, combining risk and curvature, is key to understanding gradient descent dynamics at the Edge of Stability.

Principles

Large learning rates induce persistent loss and sharpness oscillations in gradient descent.
An effective free energy, not just risk, is the natural quantity to monitor in unstable regimes.
The kinetic equation for two-layer networks can be interpreted as a Wasserstein-2 gradient flow.

Method

A continuous-time effective model tracks the average trajectory and time-averaged covariance of fast oscillations to monitor gradient descent dynamics in unstable regimes.

In practice

Track oscillation envelopes and training spikes using the proposed effective model.
Monitor the effective free energy to predict dynamics in unstable gradient descent.
Apply the mean-field limit to analyze wide two-layer neural network training.

Topics

Gradient Descent
Edge of Stability
Neural Network Training
Free Energy Models
Kinetic Equations
Wasserstein Gradient Flow
CIFAR-10

Best for: Research Scientist, AI Scientist

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.