Geometry-Preserving Orthonormal Initialization for Low-Rank Adaptation in RLVR
Summary
A new study introduces Geometry-Preserving Orthonormal Initialization for Low-Rank Adaptation (LoRA) in Reinforcement Learning with Verifiable Rewards (RLVR), addressing the underperformance and instability of existing LoRA variants like PiSSA and MiLoRA in RLVR settings. While these variants excel in supervised fine-tuning (SFT), their efficacy in RLVR has been unclear. Through theoretical analysis, the research demonstrates that orthonormal initialization achieves the minimal performance gap between LoRA and full fine-tuning outcomes in RLVR. This insight guided the development of two new LoRA variants, RLPO and RLMO, which incorporate geometry-preserving orthonormal initialization. Experiments on mathematical reasoning benchmarks confirm that this proposed initialization method stabilizes RLVR training and consistently outperforms standard LoRA, a finding that contrasts with the behavior of PiSSA and MiLoRA. The analysis also provides a unified explanation for why PiSSA and MiLoRA underperform in RLVR. Code and checkpoints are publicly available.
Key takeaway
For Machine Learning Engineers fine-tuning large language models using Low-Rank Adaptation (LoRA) within Reinforcement Learning with Verifiable Rewards (RLVR) frameworks, you should re-evaluate your initialization strategies. Standard LoRA and SFT-optimized variants like PiSSA or MiLoRA can lead to training instability and suboptimal performance in RLVR. Instead, consider implementing geometry-preserving orthonormal initialization, such as the proposed RLPO or RLMO variants, to achieve more stable training and superior results on mathematical reasoning benchmarks.
Key insights
Orthonormal initialization significantly improves Low-Rank Adaptation (LoRA) stability and performance in Reinforcement Learning with Verifiable Rewards (RLVR).
Principles
- Orthonormal initialization minimizes LoRA's performance gap to full fine-tuning in RLVR.
- LoRA variants optimized for SFT may underperform or destabilize RLVR training.
Method
A theoretical analysis guides the development of geometry-preserving orthonormal initialization, leading to new LoRA variants, RLPO and RLMO, specifically for RLVR.
In practice
- Apply geometry-preserving orthonormal initialization to stabilize RLVR training.
- Use RLPO or RLMO variants to achieve superior performance over standard LoRA in RLVR.
Topics
- Low-Rank Adaptation
- Reinforcement Learning
- Orthonormal Initialization
- Parameter-Efficient Fine-Tuning
- Mathematical Reasoning
- Large Language Models
Code references
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.