Beyond Muon: MUD (MomentUm Decorrelation) for Faster Transformer Training

· Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Advanced, quick

Summary

MUD (MomentUm Decorrelation) is a new optimizer designed to accelerate Transformer training, offering 10-50% wall-clock time improvements over tuned AdamW and Muon in time-to-perplexity. MUD replaces Muon's polar decomposition-based orthogonalized-momentum updates with a triangular (Cholesky-like) whitening surrogate, drawing inspiration from Gram--Schmidt and Gauss-Seidel methods. This approach significantly reduces optimizer overhead, improving peak tokens/s by approximately 1.3-2.6x across most settings, and up to nearly 3x on GPT-2 large models running on A100 GPUs, compared to Muon. While MUD converges slightly slower per step than Muon, its lower overhead leads to faster overall training times, as demonstrated by matching Muon-level validation perplexity for an ESM-2 150M protein language model in less wall-clock time.

Key takeaway

For AI Engineers optimizing large language model training, MUD offers a significant opportunity to reduce wall-clock training time. If your current setup uses AdamW or Muon, consider integrating MUD to achieve 10-50% faster convergence to target perplexity, especially on A100 GPUs. This could translate to substantial cost savings and quicker iteration cycles for model development and deployment.

Key insights

MUD optimizer accelerates Transformer training by replacing polar decomposition with a Cholesky-like whitening surrogate, reducing overhead.

Principles

Method

MUD uses a triangular (Cholesky-like) whitening surrogate, inspired by Gram--Schmidt and Gauss-Seidel, to decorrelate momentum updates, replacing Muon's polar decomposition.

In practice

Topics

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.