Zeroth-Order Optimization at the Edge of Stability
Summary
This work investigates the optimization dynamics of zeroth-order (ZO) methods, including ZO-GD, ZO-GDM, and ZO-Adam, in deep learning, focusing on their mean-square linear stability. Unlike first-order (FO) methods, whose stability is governed by the largest Hessian eigenvalue, ZO methods' mean-square stability depends on the entire Hessian spectrum. The authors derive explicit step size conditions and tractable stability bounds that rely on the Hessian trace and largest eigenvalue. Empirically, full-batch ZO methods consistently operate at the "edge of stability" (EoS), stabilizing near the predicted mean-square stability boundary across various deep learning tasks and architectures like CNNs, ResNets, and Vision Transformers on CIFAR-10, and LSTM/Mamba on a synthetic sorting task. This behavior is primarily driven by trace-based curvature quantities, highlighting an implicit regularization effect where large step sizes regularize the Hessian trace in ZO methods, contrasting with FO methods that regularize the top eigenvalue.
Key takeaway
For Research Scientists working with large models or black-box optimization, understanding ZO methods' mean-square edge of stability is crucial. Your approach to hyperparameter tuning, especially step size and momentum, should account for the Hessian trace's dominant role in stability, rather than just the top eigenvalue. This implies that increasing momentum in ZO methods may shrink the stable region, contrary to FO methods, requiring careful re-evaluation of common optimization heuristics.
Key insights
Zeroth-order optimization stability depends on the full Hessian spectrum, not just the largest eigenvalue.
Principles
- ZO stability is governed by mean-square dynamics due to inherent randomness.
- Momentum shrinks the stable regime for ZO methods, unlike FO methods.
- Large step sizes implicitly regularize the Hessian trace in ZO training.
Method
A mean-square linear stability theory for ZO methods is developed, analyzing second-moment recursions via cone-preserving linear operators and the Krein–Rutman Theorem.
In practice
- Track Hessian trace and top eigenvalue for ZO stability assessment.
- Consider smoothing parameter \mu's effect on curvature growth.
- Be aware of momentum's inverse effect on ZO stability vs. FO.
Topics
- Zeroth-Order Optimization
- Mean-Square Stability
- Edge of Stability
- Hessian Spectrum
- Implicit Regularization
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.