Reward Design for Physical Reasoning in Vision-Language Models

2026-04-16 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

A systematic reward ablation study for Group Relative Policy Optimization (GRPO)-based Vision-Language Model (VLM) training on physical reasoning was conducted using IBM Granite Vision 3.3 (2B) on the PhyX benchmark. Researchers compared four reward signals: format compliance, answer accuracy, a composite rubric reward (answer correctness, physics principle identification, unit consistency), and a novel internal attention-weight reward. The study found that GRPO with accuracy-based rewards generally outperformed Supervised Fine-Tuning (SFT) across most physics domains in both multiple-choice (MCQ) and open-ended (OE) formats. Reward design did not uniformly improve performance but induced domain-specific reasoning behaviors. Accuracy-based rewards provided the strongest overall gains, while rubric rewards improved structured reasoning quality without consistent accuracy improvements. Attention-based rewards enhanced spatial reasoning, increasing accuracy from 0.27 to 0.50 in spatial relationship problems, but degraded performance in symbolic domains.

Key takeaway

For Computer Vision Engineers developing VLMs for physical reasoning, your choice of reward signal directly impacts model behavior and performance across different reasoning types. If your application demands strong spatial reasoning, integrate attention-based rewards. For overall accuracy, prioritize accuracy-based rewards. Be aware that complex rubric rewards may not yield consistent accuracy gains in smaller models due to optimization instability, suggesting a trade-off between reasoning integrity and raw accuracy.

Key insights

Reward design significantly shapes VLM physical reasoning, with accuracy and attention-based signals driving domain-specific performance gains.

Principles

Reward complexity can destabilize optimization in smaller models.
Visual grounding and symbolic reasoning may compete for VLM representational resources.

Method

A systematic reward ablation study for GRPO-based VLM training on physical reasoning, comparing four reward signals: format, accuracy, rubric, and a novel internal attention-weight reward, evaluated on the PhyX benchmark.

In practice

Prioritize accuracy-based rewards for overall VLM physical reasoning gains.
Use attention-based rewards to enhance spatial reasoning in VLMs.
Consider rubric rewards for improving structured reasoning quality.

Topics

Reward Design
Vision-Language Models
Physical Reasoning
Group Relative Policy Optimization
PhyX Benchmark

Best for: Computer Vision Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.