The Implicit Bias of Steepest Descent with Mini-batch Stochastic Gradient
Summary
This research investigates the implicit bias of mini-batch stochastic steepest descent in multi-class classification, analyzing how batch size, momentum, and variance reduction influence max-margin behavior and convergence rates under general entry-wise and Schatten-p norms. Key findings include that without momentum, convergence to an approximate max-margin solution with a batch-dependent gap requires large batches, matching full-batch rates. Momentum enables small-batch convergence through a batch-momentum trade-off, albeit with slower, dimension-free rates. Variance reduction is shown to recover the exact full-batch implicit bias for any batch size, though at a slower convergence rate. Conversely, batch-size-one steepest descent without momentum converges to a fundamentally different bias driven by sample averaging, not max-margin geometry.
Key takeaway
For machine learning engineers optimizing large-scale models, understanding the interplay between batch size, momentum, and variance reduction is crucial. If you are using small batches, momentum (e.g., β₁=0.99) is essential to achieve max-margin solutions, even if it slows convergence. For exact full-batch implicit bias, variance reduction is a robust option, though it may incur a slower rate. Be aware that per-sample updates can lead to a fundamentally different bias, driven by sample averaging rather than max-margin geometry.
Key insights
Batch size, momentum, and variance reduction critically shape the implicit bias of stochastic steepest descent.
Principles
- Momentum stabilizes mini-batch noise for small-batch convergence.
- Variance reduction restores full-batch implicit bias regardless of batch size.
- Small batches without momentum can lead to distinct implicit biases.
Method
The study uses a unified steepest descent framework for multi-class classification with cross-entropy or exponential loss, analyzing EMA momentum and SVR-like variance reduction.
In practice
- Use large batches for steepest descent without momentum.
- Employ momentum to stabilize small-batch training.
- Integrate variance reduction for exact full-batch implicit bias.
Topics
- Implicit Bias
- Stochastic Gradient Descent
- Mini-batch Optimization
- Momentum
- Variance Reduction
- Max-Margin Classification
Best for: AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.