Weight-Space Geometry of Offline Reasoning Training

2026-06-21 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A study on the weight-space geometry of offline reasoning training methods investigates whether various reinforcement learning losses (RFT, RIFT, DFT, Offline GRPO, DPO) and Supervised Fine-Tuning (SFT) produce distinct or similar weight updates. Training six methods on identical math rollouts from a Qwen3-4B base model with attention-only LoRA, researchers analyzed weight deltas using cosine similarity, principal-angle subspace analysis, linear mode connectivity, and CKA. Findings indicate SFT, RFT, and RIFT exhibit nearly colinear weight deltas (cosine >= 0.97) and comparable GSM8K accuracy (87-88%). DFT showed greater directional divergence, while Offline GRPO introduced a substantial orthogonal component (~67% globally) to the SFT direction. DPO occupied a near-orthogonal subspace, demonstrated a mode-connectivity barrier, and achieved the highest accuracy on GSM8K (93.5%) and AIME26 (30.0%), though its training used a 10x smaller learning rate.

Key takeaway

For Machine Learning Engineers selecting offline reasoning training methods, recognize that DPO achieves significantly higher accuracy (93.5% GSM8K) but operates in a near-orthogonal weight-space subspace. If your goal is maximizing reasoning performance, prioritize DPO, but be aware its superior results are tied to a 10x smaller learning rate. You should carefully tune DPO's learning rate and consider its unique update dynamics compared to colinear methods like SFT or RFT.

Key insights

Offline reasoning training methods exhibit distinct weight-space update geometries, with DPO showing unique characteristics and superior accuracy.

Principles

Weight update geometry varies significantly across offline RL losses.
Learning rate choice profoundly impacts DPO's update and performance.
Colinear weight deltas can lead to similar downstream accuracy.

Method

The study compares SFT, RFT, DFT, RIFT, Offline GRPO, and DPO by analyzing weight deltas using cosine similarity, principal-angle subspace analysis, linear mode connectivity, and CKA.

In practice

Prioritize DPO for highest reasoning accuracy, accounting for its learning rate.
Analyze weight-space geometry to understand method distinctions.
SFT, RFT, RIFT offer similar performance with shared update directions.

Topics

Offline Reinforcement Learning
Direct Preference Optimization
Weight-Space Geometry
Model Fine-Tuning
Qwen3-4B
LoRA

Best for: Research Scientist, NLP Engineer, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.