LoRA-Muon: Spectral Steepest Descent on the Low-Rank Manifold

2025-02-13 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

LoRA-Muon is a novel optimizer designed to address the tuning challenges of Low-Rank Adaptation (LoRA) when using standard optimizers like AdamW. Derived by applying the Muon optimizer's spectral steepest-descent rule to the low-rank manifold, LoRA-Muon introduces a split weight-decay rule and ensures optimal learning rates transfer across rank, width, depth, and factor-rescaling. In compute-matched TinyShakespeare experiments, a rank-2 LoRA-Muon proxy recovered the dense best tested learning rate of 0.1, and a rank-32 run achieved a lower mean validation loss of 1.776 ± 0.002 compared to the dense baseline's 1.789 ± 0.002. The research also highlights Spectron's sensitivity to arbitrary factor scaling, contrasting with LoRA-Muon's gauge invariance, and clarifies that LoRA-RITE's simplified QR-coordinate core implements the same spectral update without QR factorizations or second moments.

Key takeaway

For machine learning engineers optimizing large language models with LoRA, adopting LoRA-Muon can significantly streamline hyperparameter tuning. Its ability to transfer optimal learning rates across various model configurations (rank, width, depth) means you can efficiently find effective learning rates using smaller, compute-matched LoRA proxies before scaling to full-rank or larger models. This reduces experimental costs and improves tuning reliability, especially when dealing with diverse LoRA factor initializations.

Key insights

LoRA-Muon enables robust, transferable learning rates for low-rank adaptation by applying spectral steepest descent on the low-rank manifold.

Principles

Optimizer geometry should align with the low-rank matrix manifold itself.
Gauge invariance ensures consistent updates despite arbitrary factor reparameterization.
Optimal learning rates can transfer across LoRA ranks, widths, and depths.

Method

LoRA-Muon derives factor updates by solving decoupled subproblems in the tangent space of the low-rank manifold, specializing to the spectral norm, and uses a split weight-decay rule.

In practice

Use LoRA-Muon to efficiently find optimal learning rates for dense Muon.
Achieve lower validation loss with rank-32 LoRA-Muon than dense baselines.
Avoid Spectron for finetuning with badly imbalanced LoRA factors.

Topics

LoRA
Low-Rank Adaptation
Muon Optimizer
Spectral Steepest Descent
Hyperparameter Tuning
Gauge Invariance
TinyShakespeare

Code references

karpathy/char-rnn

Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.