Reward Design for Physical Reasoning in Vision-Language Models

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

A systematic reward ablation study for Group Relative Policy Optimization (GRPO)-based Vision-Language Model (VLM) training on physical reasoning was conducted using IBM Granite Vision 3.3 (2B) on the PhyX benchmark. Researchers compared four reward signals: format compliance, answer accuracy, a composite rubric reward (answer correctness, physics principle identification, unit consistency), and a novel internal attention-weight reward. The study found that GRPO with accuracy-based rewards generally outperformed Supervised Fine-Tuning (SFT) across most physics domains in both multiple-choice (MCQ) and open-ended (OE) formats. Reward design did not uniformly improve performance but induced domain-specific reasoning behaviors. Accuracy-based rewards provided the strongest overall gains, while rubric rewards improved structured reasoning quality without consistent accuracy improvements. Attention-based rewards enhanced spatial reasoning, increasing accuracy from 0.27 to 0.50 in spatial relationship problems, but degraded performance in symbolic domains.

Key takeaway

For Computer Vision Engineers developing VLMs for physical reasoning, your choice of reward signal directly impacts model behavior and performance across different reasoning types. If your application demands strong spatial reasoning, integrate attention-based rewards. For overall accuracy, prioritize accuracy-based rewards. Be aware that complex rubric rewards may not yield consistent accuracy gains in smaller models due to optimization instability, suggesting a trade-off between reasoning integrity and raw accuracy.

Key insights

Reward design significantly shapes VLM physical reasoning, with accuracy and attention-based signals driving domain-specific performance gains.

Principles

Method

A systematic reward ablation study for GRPO-based VLM training on physical reasoning, comparing four reward signals: format, accuracy, rubric, and a novel internal attention-weight reward, evaluated on the PhyX benchmark.

In practice

Topics

Best for: Computer Vision Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.