Reward Design for Physical Reasoning in Vision-Language Models
Summary
A systematic reward ablation study for Group Relative Policy Optimization (GRPO)-based Vision-Language Model (VLM) training on physical reasoning was conducted using IBM Granite Vision 3.3 (2B) on the PhyX benchmark. Researchers compared four reward signals: format compliance, answer accuracy, a composite rubric reward (answer correctness, physics principle identification, unit consistency), and a novel internal attention-weight reward. The study found that GRPO with accuracy-based rewards generally outperformed Supervised Fine-Tuning (SFT) across most physics domains in both multiple-choice (MCQ) and open-ended (OE) formats. Reward design did not uniformly improve performance but induced domain-specific reasoning behaviors. Accuracy-based rewards provided the strongest overall gains, while rubric rewards improved structured reasoning quality without consistent accuracy improvements. Attention-based rewards enhanced spatial reasoning, increasing accuracy from 0.27 to 0.50 in spatial relationship problems, but degraded performance in symbolic domains.
Key takeaway
For Computer Vision Engineers developing VLMs for physical reasoning, your choice of reward signal directly impacts model behavior and performance across different reasoning types. If your application demands strong spatial reasoning, integrate attention-based rewards. For overall accuracy, prioritize accuracy-based rewards. Be aware that complex rubric rewards may not yield consistent accuracy gains in smaller models due to optimization instability, suggesting a trade-off between reasoning integrity and raw accuracy.
Key insights
Reward design significantly shapes VLM physical reasoning, with accuracy and attention-based signals driving domain-specific performance gains.
Principles
- Reward complexity can destabilize optimization in smaller models.
- Visual grounding and symbolic reasoning may compete for VLM representational resources.
Method
A systematic reward ablation study for GRPO-based VLM training on physical reasoning, comparing four reward signals: format, accuracy, rubric, and a novel internal attention-weight reward, evaluated on the PhyX benchmark.
In practice
- Prioritize accuracy-based rewards for overall VLM physical reasoning gains.
- Use attention-based rewards to enhance spatial reasoning in VLMs.
- Consider rubric rewards for improving structured reasoning quality.
Topics
- Reward Design
- Vision-Language Models
- Physical Reasoning
- Group Relative Policy Optimization
- PhyX Benchmark
Best for: Computer Vision Engineer, AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.