Gradient descent at the Edge of Stability: free energy model and kinetic description of the two-layer network

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences · Depth: Expert, extended

Summary

A new continuous-time effective model is proposed to analyze gradient descent dynamics in the Edge of Stability (EoS) regime, where large learning rates induce persistent loss and sharpness oscillations. This model tracks the evolution of the average trajectory (θ) coupled with the time-averaged covariance (Σ) of its fast oscillations. The analysis introduces an "effective free energy" F(θ,Σ), combining the original risk functional with a curvature-related "entropic" term, as the natural quantity to monitor. The model accurately captures oscillation envelopes and explains sharpness increases, even when dynamics evolve on similar timescales as averaged weights. For wide two-layer neural networks, a mean-field limit yields a novel kinetic equation describing joint weight and fluctuation distributions. Numerical evidence on matrix factorization (d=2, L=3, η=0.077) and CIFAR-10 (n=500 images, 2-layer CNN/MLP, η=0.02) validates the model's accuracy and the predictive power of the effective free energy.

Key takeaway

For machine learning engineers and research scientists optimizing deep networks in the Edge of Stability regime, you should consider monitoring the proposed effective free energy F(θ,Σ) instead of solely the loss E(θ). This new continuous-time model offers a more accurate understanding of optimization dynamics, particularly for tracking oscillation envelopes and explaining sharpness increases. Implementing the coupled ODEs for θ and Σ can provide better predictive power for training behavior and potentially guide hyperparameter tuning towards solutions with improved generalization properties.

Key insights

A new model tracks average trajectory and oscillation covariance, revealing an effective free energy for Edge of Stability dynamics.

Principles

Method

The model uses an ansatz θ̃_k = θ_k + √η δθ_k to derive coupled continuous-time ODEs for the average trajectory (θ) and the covariance of oscillations (Σ).

In practice

Topics

Best for: AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.