Zeroth-Order Optimization at the Edge of Stability

· Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences · Depth: Expert, extended

Summary

This work investigates the optimization dynamics of zeroth-order (ZO) methods, including ZO-GD, ZO-GDM, and ZO-Adam, in deep learning, focusing on their mean-square linear stability. Unlike first-order (FO) methods, whose stability is governed by the largest Hessian eigenvalue, ZO methods' mean-square stability depends on the entire Hessian spectrum. The authors derive explicit step size conditions and tractable stability bounds that rely on the Hessian trace and largest eigenvalue. Empirically, full-batch ZO methods consistently operate at the "edge of stability" (EoS), stabilizing near the predicted mean-square stability boundary across various deep learning tasks and architectures like CNNs, ResNets, and Vision Transformers on CIFAR-10, and LSTM/Mamba on a synthetic sorting task. This behavior is primarily driven by trace-based curvature quantities, highlighting an implicit regularization effect where large step sizes regularize the Hessian trace in ZO methods, contrasting with FO methods that regularize the top eigenvalue.

Key takeaway

For Research Scientists working with large models or black-box optimization, understanding ZO methods' mean-square edge of stability is crucial. Your approach to hyperparameter tuning, especially step size and momentum, should account for the Hessian trace's dominant role in stability, rather than just the top eigenvalue. This implies that increasing momentum in ZO methods may shrink the stable region, contrary to FO methods, requiring careful re-evaluation of common optimization heuristics.

Key insights

Zeroth-order optimization stability depends on the full Hessian spectrum, not just the largest eigenvalue.

Principles

Method

A mean-square linear stability theory for ZO methods is developed, analyzing second-moment recursions via cone-preserving linear operators and the Krein–Rutman Theorem.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.