LoRA-Muon: Spectral Steepest Descent on the Low-Rank Manifold
Summary
LoRA-Muon is a novel optimizer designed to address the tuning challenges of Low-Rank Adaptation (LoRA) when using standard optimizers like AdamW. Derived by applying the Muon optimizer's spectral steepest-descent rule to the low-rank manifold, LoRA-Muon introduces a split weight-decay rule and ensures optimal learning rates transfer across rank, width, depth, and factor-rescaling. In compute-matched TinyShakespeare experiments, a rank-2 LoRA-Muon proxy recovered the dense best tested learning rate of 0.1, and a rank-32 run achieved a lower mean validation loss of 1.776 ± 0.002 compared to the dense baseline's 1.789 ± 0.002. The research also highlights Spectron's sensitivity to arbitrary factor scaling, contrasting with LoRA-Muon's gauge invariance, and clarifies that LoRA-RITE's simplified QR-coordinate core implements the same spectral update without QR factorizations or second moments.
Key takeaway
For machine learning engineers optimizing large language models with LoRA, adopting LoRA-Muon can significantly streamline hyperparameter tuning. Its ability to transfer optimal learning rates across various model configurations (rank, width, depth) means you can efficiently find effective learning rates using smaller, compute-matched LoRA proxies before scaling to full-rank or larger models. This reduces experimental costs and improves tuning reliability, especially when dealing with diverse LoRA factor initializations.
Key insights
LoRA-Muon enables robust, transferable learning rates for low-rank adaptation by applying spectral steepest descent on the low-rank manifold.
Principles
- Optimizer geometry should align with the low-rank matrix manifold itself.
- Gauge invariance ensures consistent updates despite arbitrary factor reparameterization.
- Optimal learning rates can transfer across LoRA ranks, widths, and depths.
Method
LoRA-Muon derives factor updates by solving decoupled subproblems in the tangent space of the low-rank manifold, specializing to the spectral norm, and uses a split weight-decay rule.
In practice
- Use LoRA-Muon to efficiently find optimal learning rates for dense Muon.
- Achieve lower validation loss with rank-32 LoRA-Muon than dense baselines.
- Avoid Spectron for finetuning with badly imbalanced LoRA factors.
Topics
- LoRA
- Low-Rank Adaptation
- Muon Optimizer
- Spectral Steepest Descent
- Hyperparameter Tuning
- Gauge Invariance
- TinyShakespeare
Code references
Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.