Advancing Emerging Optimizers for Accelerated LLM Training with NVIDIA Megatron

2026-04-22 · Source: NVIDIA Technical Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Cloud Computing & IT Infrastructure · Depth: Advanced, medium

Summary

NVIDIA provides comprehensive support for higher-order optimization algorithms like Muon (MomentUm Orthogonalized by Newton-Schulz) for training large language models (LLMs) at scale. Muon has been successfully applied to models such as Kimi K2 and GLM-5. Benchmarking on the NVIDIA GB300 NVL72 system using NVIDIA NeMo Megatron Bridge 26.02 showed that Muon achieved training throughput nearly on par with the AdamW optimizer for Kimi K2 and Qwen3 30B models. Key enabling technologies include a layer-wise distributed optimizer, which assigns entire layers to individual GPUs to facilitate full-layer preconditioning, and distributed Newton-Schulz methods (duplicated, distributed, and blockwise modes) to handle tensor parallelism challenges. Additional optimizations like communication hiding, load balancing, and fused SYRK/all-reduce kernels are under development to further enhance performance.

Key takeaway

For AI Architects and Machine Learning Engineers deploying large-scale LLM training, NVIDIA's support for higher-order optimizers like Muon in Megatron Core offers near-AdamW throughput on GB300 systems. You should consider integrating these optimizers, leveraging layer-wise distribution and distributed Newton-Schulz modes, to potentially improve training efficiency. Evaluate duplicated versus distributed NS modes based on your specific network or computational bottlenecks to optimize performance.

Key insights

NVIDIA enables large-scale LLM training with higher-order optimizers like Muon through specialized distributed computing techniques.

Principles

Layer-wise distribution supports full-layer preconditioning.
Distributed Newton-Schulz handles tensor parallelism.
Optimizers balance generality, throughput, and complexity.

Method

The approach involves partitioning optimizer states layer-wise, performing reduce-scatter gradients, local updates, and all-gathering parameters, with specific distributed Newton-Schulz modes for tensor parallelism.

In practice

Use duplicated NS mode for network latency bottlenecks.
Use distributed NS mode for computation bottlenecks.
Explore MOP and REKLS in NVIDIA Emerging Optimizers.

Topics

LLM Training Optimization
Muon Optimizer
NVIDIA Megatron Core
Distributed Optimizers
Newton-Schulz Iteration

Code references

Best for: AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.