The Implicit Bias of Steepest Descent with Mini-batch Stochastic Gradient

2026-06-18 · Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences · Depth: Expert, extended

Summary

This research investigates the implicit bias of mini-batch stochastic steepest descent in multi-class classification, analyzing how batch size, momentum, and variance reduction influence max-margin behavior and convergence rates under general entry-wise and Schatten-p norms. Key findings include that without momentum, convergence to an approximate max-margin solution with a batch-dependent gap requires large batches, matching full-batch rates. Momentum enables small-batch convergence through a batch-momentum trade-off, albeit with slower, dimension-free rates. Variance reduction is shown to recover the exact full-batch implicit bias for any batch size, though at a slower convergence rate. Conversely, batch-size-one steepest descent without momentum converges to a fundamentally different bias driven by sample averaging, not max-margin geometry.

Key takeaway

For machine learning engineers optimizing large-scale models, understanding the interplay between batch size, momentum, and variance reduction is crucial. If you are using small batches, momentum (e.g., β₁=0.99) is essential to achieve max-margin solutions, even if it slows convergence. For exact full-batch implicit bias, variance reduction is a robust option, though it may incur a slower rate. Be aware that per-sample updates can lead to a fundamentally different bias, driven by sample averaging rather than max-margin geometry.

Key insights

Batch size, momentum, and variance reduction critically shape the implicit bias of stochastic steepest descent.

Principles

Momentum stabilizes mini-batch noise for small-batch convergence.
Variance reduction restores full-batch implicit bias regardless of batch size.
Small batches without momentum can lead to distinct implicit biases.

Method

The study uses a unified steepest descent framework for multi-class classification with cross-entropy or exponential loss, analyzing EMA momentum and SVR-like variance reduction.

In practice

Use large batches for steepest descent without momentum.
Employ momentum to stabilize small-batch training.
Integrate variance reduction for exact full-batch implicit bias.

Topics

Implicit Bias
Stochastic Gradient Descent
Mini-batch Optimization
Momentum
Variance Reduction
Max-Margin Classification

Best for: AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.