Fisher-Geometric Diffusion in Stochastic Gradient Descent: Optimal Rates, Oracle Complexity, and Information-Theoretic Limits

· Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Mathematics & Computational Sciences · Depth: Expert, extended

Summary

This paper introduces a Fisher-geometric theory for Stochastic Gradient Descent (SGD), asserting that mini-batch noise is an intrinsic, loss-induced matrix rather than an exogenous scalar. Under exchangeable sampling, the mini-batch gradient covariance is primarily determined by the projected covariance of per-sample gradients, equating to projected Fisher information for well-specified likelihood losses and the projected Godambe matrix for general M-estimation losses. This identification leads to a diffusion approximation with Fisher/Godambe-structured volatility, where the effective temperature is defined as $\tau=\eta/b$. The theory yields an Ornstein-Uhlenbeck linearization whose stationary covariance is given in closed form by a Fisher-Lyapunov equation. The authors prove matching minimax upper and lower bounds of order $\Theta(1/N)$ for Fisher/Godambe risk under a total oracle budget $N$, and derive oracle-complexity guarantees for $\varepsilon$-stationarity in the Fisher dual norm that depend on an intrinsic effective dimension and a Fisher/Godambe condition number, rather than ambient dimension or Euclidean conditioning. Experiments validate the Lyapunov predictions and demonstrate that scalar temperature matching fails to reproduce directional noise structure.

Key takeaway

For AI Researchers and Research Scientists optimizing models with SGD, understanding that mini-batch noise has an intrinsic Fisher/Godambe geometry is crucial. This means your batch size decisions directly control the "temperature" of a diffusion process, shaping the anisotropic noise and influencing convergence in statistically meaningful directions. You should consider optimizing for Fisher/Godambe risk, which scales with intrinsic effective dimension and condition number, rather than relying solely on Euclidean metrics or scalar variance assumptions, especially in operations research settings where sampling effort is a key design variable.

Key insights

Mini-batch noise in SGD possesses an intrinsic, loss-induced matrix geometry, not scalar variance.

Principles

Method

The paper identifies mini-batch noise covariance as $G^{\star}(\theta)/b$, leading to a diffusion approximation and an Ornstein-Uhlenbeck linearization whose stationary covariance solves a Fisher-Lyapunov equation.

In practice

Topics

Best for: AI Researcher, AI Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.