Stochastic Gradient Descent in the Saddle-to-Saddle Regime of Deep Linear Networks

2026-04-09 · Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences · Depth: Expert, extended

Summary

This research investigates the dynamics of Stochastic Gradient Descent (SGD) in Deep Linear Networks (DLNs) within the saddle-to-saddle training regime, where features are learned sequentially. The study models SGD as stochastic Langevin dynamics with anisotropic, state-dependent noise, extending previous analyses of gradient flow. Under assumptions of aligned and balanced weights, the authors derive an exact decomposition of the dynamics into one-dimensional per-mode stochastic differential equations. Key findings include that maximal diffusion along a mode precedes its complete learning, indicating that SGD noise encodes information about feature learning progression. The stationary distribution of SGD for each mode is also characterized: it matches gradient flow's Dirac mass in the absence of label noise, but approximates a Boltzmann distribution when label noise is present. Experimental results qualitatively confirm these theoretical predictions even when strict alignment and balance assumptions are relaxed, demonstrating that SGD noise informs feature learning without fundamentally altering the saddle-to-saddle dynamics.

Key takeaway

For research scientists studying deep learning optimization, understanding that SGD noise provides predictive signals for feature learning progression in DLNs is crucial. You should consider incorporating state-dependent and anisotropic noise models into your theoretical analyses, as they more accurately reflect SGD's behavior than isotropic models. This insight helps clarify the role of stochasticity, suggesting it's not merely a computational convenience but an informative signal, which could refine your understanding of implicit bias and generalization in deep neural networks.

Key insights

SGD noise provides information about feature learning progression in deep linear networks without altering saddle-to-saddle dynamics.

Principles

SGD noise is state-dependent and anisotropic.
Maximal diffusion along a mode predicts feature learning completion.
SGD stationary distribution matches GD without label noise.

Method

The study models SGD as anisotropic Langevin dynamics, decomposing it into one-dimensional per-mode stochastic differential equations under balanced and aligned weight assumptions to analyze feature learning.

In practice

Monitor modewise diffusion to predict feature learning stages.
Consider label noise effects on SGD's stationary distribution.
Use state-dependent noise models for accurate SGD simulation.

Topics

Stochastic Gradient Descent
Deep Linear Networks
Saddle-to-Saddle Dynamics
Stochastic Differential Equations
Feature Learning

Best for: Research Scientist, AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.