Beyond Muon: MUD (MomentUm Decorrelation) for Faster Transformer Training

2026-03-18 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Advanced, quick

Summary

MUD (MomentUm Decorrelation) is a new optimizer designed to accelerate Transformer training, offering 10-50% wall-clock time improvements over tuned AdamW and Muon in time-to-perplexity. MUD replaces Muon's polar decomposition-based orthogonalized-momentum updates with a triangular (Cholesky-like) whitening surrogate, drawing inspiration from Gram--Schmidt and Gauss-Seidel methods. This approach significantly reduces optimizer overhead, improving peak tokens/s by approximately 1.3-2.6x across most settings, and up to nearly 3x on GPT-2 large models running on A100 GPUs, compared to Muon. While MUD converges slightly slower per step than Muon, its lower overhead leads to faster overall training times, as demonstrated by matching Muon-level validation perplexity for an ESM-2 150M protein language model in less wall-clock time.

Key takeaway

For AI Engineers optimizing large language model training, MUD offers a significant opportunity to reduce wall-clock training time. If your current setup uses AdamW or Muon, consider integrating MUD to achieve 10-50% faster convergence to target perplexity, especially on A100 GPUs. This could translate to substantial cost savings and quicker iteration cycles for model development and deployment.

Key insights

MUD optimizer accelerates Transformer training by replacing polar decomposition with a Cholesky-like whitening surrogate, reducing overhead.

Principles

Row-orthonormal matrices are MUD map fixed points.
Inner step relates to symmetric Gauss-Seidel preconditioning.

Method

MUD uses a triangular (Cholesky-like) whitening surrogate, inspired by Gram--Schmidt and Gauss-Seidel, to decorrelate momentum updates, replacing Muon's polar decomposition.

In practice

Achieves 10-50% wall-clock speedup.
Improves peak tokens/s by 1.3-2.6x.
Trains ESM-2 150M faster.

Topics

Transformer Training
Optimization Algorithms
Momentum Optimizers
Whitening Techniques
Protein Language Models

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.