LoRA-Muon: Spectral Steepest Descent on the Low-Rank Manifold
Summary
LoRA-Muon is a new optimizer derived by applying the Muon optimizer's spectral steepest-descent rule to the low-rank adaptation (LoRA) setting. It addresses common LoRA finetuning challenges, such as sensitivity to initialization, poor learning rate transfer across ranks, and difficulty in outperforming dense baselines. The core claim is that LoRA-Muon serves as an effective low-rank proxy for full-rank Muon and Shampoo-family optimizers, exhibiting optimal learning rates that transfer consistently across rank, width, depth, and factor-rescaling. A compute-matched TinyShakespeare study demonstrated that a rank-2 LoRA-Muon proxy recovered the dense baseline's best learning rate, and a rank-32 run achieved lower mean validation loss than the dense baseline. Furthermore, LoRA-Muon avoids the arbitrary factor scaling dependency seen in Spectron and computes spectral updates without QR-decomposition, making it more accelerator-friendly and memory-efficient than alternatives like LoRA-RITE. This work was published on 2026-06-11.
Key takeaway
For Machine Learning Engineers finetuning large models with LoRA, LoRA-Muon offers a significant improvement over existing optimizers. You should consider adopting LoRA-Muon to achieve more stable training, better learning rate transferability across different model configurations, and potentially superior validation loss compared to dense baselines. This optimizer also provides memory efficiency and accelerator-friendliness, streamlining your finetuning workflows and reducing computational overhead.
Key insights
LoRA-Muon applies spectral steepest descent to LoRA, improving finetuning stability, learning rate transferability, and performance over dense baselines.
Principles
- Spectral steepest descent enhances LoRA stability.
- Learning rates can transfer across ranks.
- Factor-wise optimizers need careful tuning.
Method
LoRA-Muon applies Muon's spectral steepest-descent rule to low-rank settings, incorporating a split weight-decay rule. It computes spectral updates directly, avoiding QR-decomposition and second moment storage.
In practice
- Use LoRA-Muon for stable LoRA finetuning.
- Benefit from consistent learning rates.
- Employ for memory-efficient finetuning.
Topics
- LoRA-Muon
- Low-Rank Adaptation
- Deep Learning Optimizers
- Spectral Descent
- Model Finetuning
- Memory Efficiency
Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.