The First Optimizer to Challenge Adam in a Decade Just Cut Training Costs in Half
Summary
Muon, a new optimizer, has emerged as the first significant challenger to Adam's decade-long dominance, demonstrating approximately 2x compute efficiency compared to AdamW in compute-optimal training. Its enhanced variant, NorMuon, further improves efficiency by 21.74% on a 1.1B parameter model. Several large-scale production models, including Kimi K2 (1T parameters), GLM-4.5 (355B), and INTELLECT-3 (106B), have already adopted Muon in 2025. The optimizer is slated for native inclusion in PyTorch 2.9 and will subsequently be integrated into DeepSpeed and NVIDIA NeMo, indicating a rapid ecosystem development with over 15 variants appearing within 18 months.
Key takeaway
For AI architects and NLP engineers focused on large language model training, evaluating and integrating Muon or NorMuon into your workflows is critical. Its demonstrated 2x compute efficiency over AdamW can halve training costs and accelerate development cycles for models like Kimi K2 or GLM-4.5. Your team should prioritize testing Muon's performance on your specific architectures, especially as it becomes natively available in PyTorch 2.9.
Key insights
Muon optimizer offers significant compute efficiency gains, challenging Adam's long-standing dominance in deep learning.
Principles
- Compute efficiency is paramount for large model training.
- Optimizer innovation can drastically reduce training costs.
In practice
- Adopt Muon for compute-optimal model training.
- Explore NorMuon for further efficiency gains.
- Utilize `torch.optim.Muon` in PyTorch 2.9.
Topics
- Optimizer Algorithms
- Compute Efficiency
- Large Language Models
- Deep Learning Frameworks
- Model Training
Best for: NLP Engineer, Computer Vision Engineer, AI Architect, AI Engineer, Machine Learning Engineer, AI Researcher
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AI Advances - Medium.