On Asymmetric Optimization of Reasoning and Perception in Vision-Language Model Post-Training

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A new diagnostic framework reveals a consistent perception-reasoning asymmetry in frontier Vision-Language Models (VLMs) during post-training, where reasoning gains significantly more than perception, creating an end-to-end visual reasoning bottleneck. For supervised fine-tuning (SFT), this imbalance stems from perception occupying fewer tokens in chain-of-thought supervision, leading to a weaker training signal. Dynamically reweighting the loss mitigates this, boosting end-to-end performance by up to 18.2%. In reinforcement learning (RL), the asymmetry arises from reward coupling, where outcome rewards correlate more strongly with reasoning. Adding a perception-aware reward improves end-to-end accuracy by up to 6.0%, with a reliable surrogate reward still yielding gains of 3.2 points.

Key takeaway

For Machine Learning Engineers optimizing Vision-Language Models, you must address the identified perception-reasoning asymmetry. If using supervised fine-tuning, reweighting loss can boost end-to-end performance by up to 18.2%. For reinforcement learning, incorporating perception-aware rewards, or even reliable surrogates, can improve accuracy by up to 6.0%, ensuring balanced visual reasoning capabilities.

Key insights

VLM post-training creates a perception-reasoning asymmetry due to token imbalance (SFT) or reward coupling (RL), hindering end-to-end performance.

Principles

Method

For SFT, dynamically reweight loss; for RL, add perception-aware or surrogate rewards to balance training signals for perception and reasoning.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.