From SGD to Muon: Adaptive Optimization via Schatten-p Norms
Summary
A novel adaptive optimizer, Muon, introduces a data-driven criterion for dynamically selecting proxy-optimal Linear Minimization Oracle (LMO) geometries for individual Deep Neural Network layers. This method, derived in closed form from gradient and activation statistics using a single-step random feature regression surrogate model, navigates a design space that interpolates between SGD and Muon updates. The framework, by integrating parameter-wise preconditioning, can recover SGD, Muon, Adam, and MuAdam as specific extrema. This adaptive approach achieves scalability with only a ~3% runtime overhead on highly optimized baselines. Proof-of-concept results demonstrate that this data-driven optimizer beats or remains competitive with the best performance between Muon and AdamW across three distinct training scenarios.
Key takeaway
For Machine Learning Engineers designing or selecting optimizers, this work suggests that dynamically adapting LMO geometry offers a promising pathway beyond static update rules. You should consider exploring data-driven adaptive optimization techniques, like the proposed Muon variant, to potentially achieve superior or competitive performance with minimal runtime overhead compared to established methods like AdamW.
Key insights
Dynamically adapting LMO geometry from runtime data improves optimizer performance and design.
Principles
- LMO geometry can be successfully adapted from runtime data.
- Adaptive optimizers can unify diverse update rules.
Method
A data-driven criterion, derived from gradient and activation statistics via a single-step random feature regression surrogate model, dynamically chooses LMO geometries.
In practice
- Achieves ~3% runtime overhead on optimized baselines.
- Recovers SGD, Muon, Adam, and MuAdam as specific extrema.
Topics
- Adaptive Optimization
- Schatten-p Norms
- LMO Theory
- Deep Neural Networks
- SGD
- Muon
- AdamW
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.