Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation

2026-02-27 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Natural Language Processing · Depth: Expert, quick

Summary

LoRA-Pre is a novel low-rank optimizer designed to reduce the memory footprint of modern optimizers like Adam and Muon, which typically incur significant overhead from first- and second-order momenta. It reframes the exponential moving average (EMA) as online gradient flow training for a linear regressor, then decomposes the full momentum matrix into a compact low-rank subspace. This approach maintains optimization performance while improving memory efficiency. Empirical validation involved pre-training Llama architecture models ranging from 60M to 1B parameters, where LoRA-Pre achieved the highest performance and demonstrated superior rank efficiency, using only 1/8 the rank of baseline methods. Furthermore, in fine-tuning scenarios, LoRA-Pre consistently outperformed standard LoRA, showing improvements of 3.14 points on Llama-3.1-8B and 6.17 points on Llama-2-7B.

Key takeaway

For AI Engineers and Research Scientists working with large language models, LoRA-Pre offers a significant advantage in memory efficiency during both pre-training and fine-tuning. Your teams can achieve comparable or superior performance with substantially reduced memory overhead, potentially enabling the training of larger models or more efficient use of existing hardware. Consider integrating LoRA-Pre, available at https://github.com/mrflogs/LoRA-Pre, to optimize your LLM development workflows.

Key insights

LoRA-Pre reduces optimizer memory overhead by applying low-rank approximation to momentum states in large model training.

Principles

EMA can be modeled as online linear regression.
Low-rank decomposition improves memory efficiency.

Method

LoRA-Pre decomposes the full momentum matrix into a low-rank subspace within an online linear learner, based on reframing EMA as online gradient flow.

In practice

Pre-train Llama models from 60M to 1B parameters.
Fine-tune Llama-3.1-8B and Llama-2-7B models.

Topics

Low-Rank Optimizers
Large Language Models
Memory Efficiency
Pre-training
Fine-tuning

Code references

mrflogs/LoRA-Pre

Best for: Research Scientist, AI Engineer, Machine Learning Engineer, AI Researcher, AI Scientist, Deep Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.