Weight-Space Geometry of Offline Reasoning Training
Summary
A study on the weight-space geometry of offline reasoning training methods investigates whether various reinforcement learning losses (RFT, RIFT, DFT, Offline GRPO, DPO) and Supervised Fine-Tuning (SFT) produce distinct or similar weight updates. Training six methods on identical math rollouts from a Qwen3-4B base model with attention-only LoRA, researchers analyzed weight deltas using cosine similarity, principal-angle subspace analysis, linear mode connectivity, and CKA. Findings indicate SFT, RFT, and RIFT exhibit nearly colinear weight deltas (cosine >= 0.97) and comparable GSM8K accuracy (87-88%). DFT showed greater directional divergence, while Offline GRPO introduced a substantial orthogonal component (~67% globally) to the SFT direction. DPO occupied a near-orthogonal subspace, demonstrated a mode-connectivity barrier, and achieved the highest accuracy on GSM8K (93.5%) and AIME26 (30.0%), though its training used a 10x smaller learning rate.
Key takeaway
For Machine Learning Engineers selecting offline reasoning training methods, recognize that DPO achieves significantly higher accuracy (93.5% GSM8K) but operates in a near-orthogonal weight-space subspace. If your goal is maximizing reasoning performance, prioritize DPO, but be aware its superior results are tied to a 10x smaller learning rate. You should carefully tune DPO's learning rate and consider its unique update dynamics compared to colinear methods like SFT or RFT.
Key insights
Offline reasoning training methods exhibit distinct weight-space update geometries, with DPO showing unique characteristics and superior accuracy.
Principles
- Weight update geometry varies significantly across offline RL losses.
- Learning rate choice profoundly impacts DPO's update and performance.
- Colinear weight deltas can lead to similar downstream accuracy.
Method
The study compares SFT, RFT, DFT, RIFT, Offline GRPO, and DPO by analyzing weight deltas using cosine similarity, principal-angle subspace analysis, linear mode connectivity, and CKA.
In practice
- Prioritize DPO for highest reasoning accuracy, accounting for its learning rate.
- Analyze weight-space geometry to understand method distinctions.
- SFT, RFT, RIFT offer similar performance with shared update directions.
Topics
- Offline Reinforcement Learning
- Direct Preference Optimization
- Weight-Space Geometry
- Model Fine-Tuning
- Qwen3-4B
- LoRA
Best for: Research Scientist, NLP Engineer, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.