The Art of Interrogation: Consistency Amplifies Factuality in Spatial Reasoning
Summary
A new self-supervised reinforcement learning (RL) framework addresses the underperformance of Large Reasoning Models (LRMs) in spatial reasoning tasks. Contrary to existing approaches that use supervised fine-tuning (SFT) with external data, this work posits that LRMs already possess spatial reasoning capabilities, which require alignment through logical coherence under geometric 2D and 3D constraints. The proposed framework employs "consistency verifiers" as reward functions. These verifiers check for geometric and semantic consistency when models process image transformations, such as flipping, or textual transformations, like reordering objects in a question. The authors introduce OT-GRPO, an optimal transport-based RL strategy and a minimal-matching variant of group relative policy optimization. This label-free consistency training method demonstrates accuracy comparable to models trained with ground-truth supervision and achieves similar generalization across diverse tasks and data domains.
Key takeaway
For Machine Learning Engineers developing Large Reasoning Models, if you are struggling with spatial reasoning performance due to limited labeled data, consider implementing self-supervised consistency training. This approach, using geometric and semantic consistency verifiers under transformations, can align existing LRM capabilities. You could achieve accuracy comparable to supervised methods without the need for costly ground-truth annotations, significantly streamlining your model development and deployment.
Key insights
Spatial reasoning in LRMs can be amplified by self-supervised consistency training without ground-truth labels.
Principles
- LRMs possess latent spatial reasoning.
- Consistency under transformations aligns reasoning.
- Label-free training can match supervised accuracy.
Method
A self-supervised RL framework uses consistency verifiers as reward functions for geometric and semantic checks under image/text transformations, employing OT-GRPO.
In practice
- Apply image transformations like flipping.
- Use textual transformations, e.g., object reordering.
- Implement OT-GRPO for pairwise verifiers.
Topics
- Large Reasoning Models
- Spatial Reasoning
- Self-supervised Learning
- Reinforcement Learning
- Consistency Verifiers
- OT-GRPO
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.