The First Optimizer to Challenge Adam in a Decade Just Cut Training Costs in Half

· Source: AI Advances - Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Advanced, quick

Summary

Muon, a new optimizer, has emerged as the first significant challenger to Adam's decade-long dominance, demonstrating approximately 2x compute efficiency compared to AdamW in compute-optimal training. Its enhanced variant, NorMuon, further improves efficiency by 21.74% on a 1.1B parameter model. Several large-scale production models, including Kimi K2 (1T parameters), GLM-4.5 (355B), and INTELLECT-3 (106B), have already adopted Muon in 2025. The optimizer is slated for native inclusion in PyTorch 2.9 and will subsequently be integrated into DeepSpeed and NVIDIA NeMo, indicating a rapid ecosystem development with over 15 variants appearing within 18 months.

Key takeaway

For AI architects and NLP engineers focused on large language model training, evaluating and integrating Muon or NorMuon into your workflows is critical. Its demonstrated 2x compute efficiency over AdamW can halve training costs and accelerate development cycles for models like Kimi K2 or GLM-4.5. Your team should prioritize testing Muon's performance on your specific architectures, especially as it becomes natively available in PyTorch 2.9.

Key insights

Muon optimizer offers significant compute efficiency gains, challenging Adam's long-standing dominance in deep learning.

Principles

In practice

Topics

Best for: NLP Engineer, Computer Vision Engineer, AI Architect, AI Engineer, Machine Learning Engineer, AI Researcher

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AI Advances - Medium.