Beyond Muon: MUD (MomentUm Decorrelation) for Faster Transformer Training
Summary
MUD (MomentUm Decorrelation) is a new optimizer designed to accelerate Transformer training, offering 10-50% wall-clock time improvements over tuned AdamW and Muon in time-to-perplexity. MUD replaces Muon's polar decomposition-based orthogonalized-momentum updates with a triangular (Cholesky-like) whitening surrogate, drawing inspiration from Gram--Schmidt and Gauss-Seidel methods. This approach significantly reduces optimizer overhead, improving peak tokens/s by approximately 1.3-2.6x across most settings, and up to nearly 3x on GPT-2 large models running on A100 GPUs, compared to Muon. While MUD converges slightly slower per step than Muon, its lower overhead leads to faster overall training times, as demonstrated by matching Muon-level validation perplexity for an ESM-2 150M protein language model in less wall-clock time.
Key takeaway
For AI Engineers optimizing large language model training, MUD offers a significant opportunity to reduce wall-clock training time. If your current setup uses AdamW or Muon, consider integrating MUD to achieve 10-50% faster convergence to target perplexity, especially on A100 GPUs. This could translate to substantial cost savings and quicker iteration cycles for model development and deployment.
Key insights
MUD optimizer accelerates Transformer training by replacing polar decomposition with a Cholesky-like whitening surrogate, reducing overhead.
Principles
- Row-orthonormal matrices are MUD map fixed points.
- Inner step relates to symmetric Gauss-Seidel preconditioning.
Method
MUD uses a triangular (Cholesky-like) whitening surrogate, inspired by Gram--Schmidt and Gauss-Seidel, to decorrelate momentum updates, replacing Muon's polar decomposition.
In practice
- Achieves 10-50% wall-clock speedup.
- Improves peak tokens/s by 1.3-2.6x.
- Trains ESM-2 150M faster.
Topics
- Transformer Training
- Optimization Algorithms
- Momentum Optimizers
- Whitening Techniques
- Protein Language Models
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.