Zeroth-Order Optimization at the Edge of Stability
Summary
A new study provides an explicit step size condition for zeroth-order (ZO) optimization methods, specifically those based on the standard two-point estimator. This research reveals that the mean-square linear stability of ZO methods is governed by the entire Hessian spectrum, contrasting sharply with first-order (FO) methods, where stability depends only on the largest Hessian eigenvalue. Recognizing the impracticality of computing the full Hessian spectrum for neural networks, the authors derive tractable stability bounds that rely solely on the largest eigenvalue and the Hessian trace. Empirical findings demonstrate that full-batch ZO methods like ZO-GD, ZO-GDM, and ZO-Adam consistently stabilize near the predicted boundary across various deep learning training tasks, highlighting an implicit regularization effect where large ZO step sizes primarily regularize the Hessian trace.
Key takeaway
For research scientists optimizing large models with zeroth-order methods, understanding that ZO stability is tied to the full Hessian spectrum, not just the top eigenvalue, is critical. You should consider the derived tractable stability bounds, which depend on the largest eigenvalue and Hessian trace, to guide step size selection. This insight suggests that larger step sizes in ZO methods implicitly regularize the Hessian trace, offering a different optimization dynamic compared to first-order approaches.
Key insights
Zeroth-order optimization stability depends on the full Hessian spectrum, unlike first-order methods.
Principles
- ZO stability depends on the entire Hessian spectrum.
- Large ZO step sizes regularize the Hessian trace.
Method
The study derives tractable stability bounds for ZO methods using only the largest Hessian eigenvalue and its trace, avoiding full spectrum computation.
In practice
- ZO methods operate at the edge of stability.
- ZO-GD, ZO-GDM, ZO-Adam stabilize near predicted bounds.
Topics
- Zeroth-Order Optimization
- Edge of Stability
- Hessian Spectrum
- Step Size Condition
- Implicit Regularization
Best for: Research Scientist, AI Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.