Advancing Emerging Optimizers for Accelerated LLM Training with NVIDIA Megatron
Summary
NVIDIA provides comprehensive support for higher-order optimization algorithms like Muon (MomentUm Orthogonalized by Newton-Schulz) for training large language models (LLMs) at scale. Muon has been successfully applied to models such as Kimi K2 and GLM-5. Benchmarking on the NVIDIA GB300 NVL72 system using NVIDIA NeMo Megatron Bridge 26.02 showed that Muon achieved training throughput nearly on par with the AdamW optimizer for Kimi K2 and Qwen3 30B models. Key enabling technologies include a layer-wise distributed optimizer, which assigns entire layers to individual GPUs to facilitate full-layer preconditioning, and distributed Newton-Schulz methods (duplicated, distributed, and blockwise modes) to handle tensor parallelism challenges. Additional optimizations like communication hiding, load balancing, and fused SYRK/all-reduce kernels are under development to further enhance performance.
Key takeaway
For AI Architects and Machine Learning Engineers deploying large-scale LLM training, NVIDIA's support for higher-order optimizers like Muon in Megatron Core offers near-AdamW throughput on GB300 systems. You should consider integrating these optimizers, leveraging layer-wise distribution and distributed Newton-Schulz modes, to potentially improve training efficiency. Evaluate duplicated versus distributed NS modes based on your specific network or computational bottlenecks to optimize performance.
Key insights
NVIDIA enables large-scale LLM training with higher-order optimizers like Muon through specialized distributed computing techniques.
Principles
- Layer-wise distribution supports full-layer preconditioning.
- Distributed Newton-Schulz handles tensor parallelism.
- Optimizers balance generality, throughput, and complexity.
Method
The approach involves partitioning optimizer states layer-wise, performing reduce-scatter gradients, local updates, and all-gathering parameters, with specific distributed Newton-Schulz modes for tensor parallelism.
In practice
- Use duplicated NS mode for network latency bottlenecks.
- Use distributed NS mode for computation bottlenecks.
- Explore MOP and REKLS in NVIDIA Emerging Optimizers.
Topics
- LLM Training Optimization
- Muon Optimizer
- NVIDIA Megatron Core
- Distributed Optimizers
- Newton-Schulz Iteration
Code references
Best for: AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.