Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation
Summary
LoRA-Pre is a novel low-rank optimizer designed to reduce the memory footprint of modern optimizers like Adam and Muon, which typically incur significant overhead from first- and second-order momenta. It reframes the exponential moving average (EMA) as online gradient flow training for a linear regressor, then decomposes the full momentum matrix into a compact low-rank subspace. This approach maintains optimization performance while improving memory efficiency. Empirical validation involved pre-training Llama architecture models ranging from 60M to 1B parameters, where LoRA-Pre achieved the highest performance and demonstrated superior rank efficiency, using only 1/8 the rank of baseline methods. Furthermore, in fine-tuning scenarios, LoRA-Pre consistently outperformed standard LoRA, showing improvements of 3.14 points on Llama-3.1-8B and 6.17 points on Llama-2-7B.
Key takeaway
For AI Engineers and Research Scientists working with large language models, LoRA-Pre offers a significant advantage in memory efficiency during both pre-training and fine-tuning. Your teams can achieve comparable or superior performance with substantially reduced memory overhead, potentially enabling the training of larger models or more efficient use of existing hardware. Consider integrating LoRA-Pre, available at https://github.com/mrflogs/LoRA-Pre, to optimize your LLM development workflows.
Key insights
LoRA-Pre reduces optimizer memory overhead by applying low-rank approximation to momentum states in large model training.
Principles
- EMA can be modeled as online linear regression.
- Low-rank decomposition improves memory efficiency.
Method
LoRA-Pre decomposes the full momentum matrix into a low-rank subspace within an online linear learner, based on reframing EMA as online gradient flow.
In practice
- Pre-train Llama models from 60M to 1B parameters.
- Fine-tune Llama-3.1-8B and Llama-2-7B models.
Topics
- Low-Rank Optimizers
- Large Language Models
- Memory Efficiency
- Pre-training
- Fine-tuning
Code references
Best for: Research Scientist, AI Engineer, Machine Learning Engineer, AI Researcher, AI Scientist, Deep Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.