LoRA-Muon: Spectral Steepest Descent on the Low-Rank Manifold

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

LoRA-Muon is a new optimizer derived by applying the Muon optimizer's spectral steepest-descent rule to the low-rank adaptation (LoRA) setting. It addresses common LoRA finetuning challenges, such as sensitivity to initialization, poor learning rate transfer across ranks, and difficulty in outperforming dense baselines. The core claim is that LoRA-Muon serves as an effective low-rank proxy for full-rank Muon and Shampoo-family optimizers, exhibiting optimal learning rates that transfer consistently across rank, width, depth, and factor-rescaling. A compute-matched TinyShakespeare study demonstrated that a rank-2 LoRA-Muon proxy recovered the dense baseline's best learning rate, and a rank-32 run achieved lower mean validation loss than the dense baseline. Furthermore, LoRA-Muon avoids the arbitrary factor scaling dependency seen in Spectron and computes spectral updates without QR-decomposition, making it more accelerator-friendly and memory-efficient than alternatives like LoRA-RITE. This work was published on 2026-06-11.

Key takeaway

For Machine Learning Engineers finetuning large models with LoRA, LoRA-Muon offers a significant improvement over existing optimizers. You should consider adopting LoRA-Muon to achieve more stable training, better learning rate transferability across different model configurations, and potentially superior validation loss compared to dense baselines. This optimizer also provides memory efficiency and accelerator-friendliness, streamlining your finetuning workflows and reducing computational overhead.

Key insights

LoRA-Muon applies spectral steepest descent to LoRA, improving finetuning stability, learning rate transferability, and performance over dense baselines.

Principles

Method

LoRA-Muon applies Muon's spectral steepest-descent rule to low-rank settings, incorporating a split weight-decay rule. It computes spectral updates directly, avoiding QR-decomposition and second moment storage.

In practice

Topics

Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.