LoRA-Muon: Spectral Steepest Descent on the Low-Rank Manifold

2026-06-11 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

LoRA-Muon is a new optimizer derived by applying the Muon optimizer's spectral steepest-descent rule to the low-rank adaptation (LoRA) setting. It addresses common LoRA finetuning challenges, such as sensitivity to initialization, poor learning rate transfer across ranks, and difficulty in outperforming dense baselines. The core claim is that LoRA-Muon serves as an effective low-rank proxy for full-rank Muon and Shampoo-family optimizers, exhibiting optimal learning rates that transfer consistently across rank, width, depth, and factor-rescaling. A compute-matched TinyShakespeare study demonstrated that a rank-2 LoRA-Muon proxy recovered the dense baseline's best learning rate, and a rank-32 run achieved lower mean validation loss than the dense baseline. Furthermore, LoRA-Muon avoids the arbitrary factor scaling dependency seen in Spectron and computes spectral updates without QR-decomposition, making it more accelerator-friendly and memory-efficient than alternatives like LoRA-RITE. This work was published on 2026-06-11.

Key takeaway

For Machine Learning Engineers finetuning large models with LoRA, LoRA-Muon offers a significant improvement over existing optimizers. You should consider adopting LoRA-Muon to achieve more stable training, better learning rate transferability across different model configurations, and potentially superior validation loss compared to dense baselines. This optimizer also provides memory efficiency and accelerator-friendliness, streamlining your finetuning workflows and reducing computational overhead.

Key insights

LoRA-Muon applies spectral steepest descent to LoRA, improving finetuning stability, learning rate transferability, and performance over dense baselines.

Principles

Spectral steepest descent enhances LoRA stability.
Learning rates can transfer across ranks.
Factor-wise optimizers need careful tuning.

Method

LoRA-Muon applies Muon's spectral steepest-descent rule to low-rank settings, incorporating a split weight-decay rule. It computes spectral updates directly, avoiding QR-decomposition and second moment storage.

In practice

Use LoRA-Muon for stable LoRA finetuning.
Benefit from consistent learning rates.
Employ for memory-efficient finetuning.

Topics

LoRA-Muon
Low-Rank Adaptation
Deep Learning Optimizers
Spectral Descent
Model Finetuning
Memory Efficiency

Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.