Gradient descent at the Edge of Stability: free energy model and kinetic description of the two-layer network

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

This study investigates the dynamics of gradient descent in the Edge of Stability regime, where large learning rates induce persistent loss and sharpness oscillations. It proposes a continuous-time effective model that tracks the evolution of the average trajectory coupled with the time-averaged covariance of its fast oscillations. The analysis reveals that an effective free energy, combining the original risk functional with a curvature-related "entropic" term, is the natural quantity to monitor in such unstable regimes. This model accurately tracks the envelope of oscillations, even when dynamics evolve on similar timescales as averaged weights, allowing for the tracking of training spikes. For wide two-layer neural networks, a mean-field limit yields a novel kinetic equation describing the joint distribution of weights and their fluctuations, interpretable as a Wasserstein-2 gradient flow of a macroscopic free energy. Numerical evidence on matrix factorization and CIFAR-10 tasks validates the model's accuracy and the predictive power of the effective free energy.

Key takeaway

For AI scientists optimizing neural networks with large learning rates, understanding the Edge of Stability regime is crucial. This research provides a robust framework, suggesting you should consider applying this effective free energy framework and kinetic description to better diagnose and predict complex gradient descent behaviors. Integrating this perspective into your training analysis tools can offer deeper insights into stability and convergence, especially when dealing with persistent loss oscillations.

Key insights

Effective free energy, combining risk and curvature, is key to understanding gradient descent dynamics at the Edge of Stability.

Principles

Method

A continuous-time effective model tracks the average trajectory and time-averaged covariance of fast oscillations to monitor gradient descent dynamics in unstable regimes.

In practice

Topics

Best for: Research Scientist, AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.