From SGD to Muon: Adaptive Optimization via Schatten-p Norms

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A novel adaptive optimizer, Muon, introduces a data-driven criterion for dynamically selecting proxy-optimal Linear Minimization Oracle (LMO) geometries for individual Deep Neural Network layers. This method, derived in closed form from gradient and activation statistics using a single-step random feature regression surrogate model, navigates a design space that interpolates between SGD and Muon updates. The framework, by integrating parameter-wise preconditioning, can recover SGD, Muon, Adam, and MuAdam as specific extrema. This adaptive approach achieves scalability with only a ~3% runtime overhead on highly optimized baselines. Proof-of-concept results demonstrate that this data-driven optimizer beats or remains competitive with the best performance between Muon and AdamW across three distinct training scenarios.

Key takeaway

For Machine Learning Engineers designing or selecting optimizers, this work suggests that dynamically adapting LMO geometry offers a promising pathway beyond static update rules. You should consider exploring data-driven adaptive optimization techniques, like the proposed Muon variant, to potentially achieve superior or competitive performance with minimal runtime overhead compared to established methods like AdamW.

Key insights

Dynamically adapting LMO geometry from runtime data improves optimizer performance and design.

Principles

Method

A data-driven criterion, derived from gradient and activation statistics via a single-step random feature regression surrogate model, dynamically chooses LMO geometries.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.