Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation

2026-06-17 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

ViGOS is a visually grounded On-Policy Self-Distillation (OPSD) framework designed for post-training Multimodal Large Language Models (MLLMs) to mitigate shortcut learning. Traditional OPSD, effective for LLM reasoning, can lead MLLMs to rely excessively on text reference targets rather than visual input when extended directly. ViGOS addresses this by decoupling perception and reasoning. In its two-stage process, the student MLLM first generates a visual description, then reasons towards the final answer. For valid rollouts, an image-only perception teacher supervises the visual description, while a privileged reasoning teacher guides the reasoning and final answer. A reference teacher is only employed for invalid rollouts to maintain output format. This approach preserves OPSD's benefits while enhancing image-grounded behavior in shortcut-prone scenarios across various benchmarks, including general vision-language, expert reasoning, visual math, spatial grounding, and visual-language-prior tasks.

Key takeaway

For Machine Learning Engineers developing or fine-tuning Multimodal Large Language Models, you should consider implementing a decoupled perception and reasoning architecture. This approach, exemplified by ViGOS, directly addresses the risk of MLLMs relying on text shortcuts instead of visual input. By separating the visual description and reasoning stages with specialized teachers, you can significantly improve your model's image-grounded behavior and overall robustness in complex multimodal tasks.

Key insights

Decoupling perception and reasoning in MLLMs prevents shortcut learning by ensuring visual grounding.

Principles

Direct OPSD extension to MLLMs creates text-bias shortcuts.
Separate teachers can enforce visual grounding and reasoning.

Method

A student MLLM first describes an image, then reasons. An image-only teacher supervises description; a reasoning teacher supervises the final answer.

In practice

Apply two-stage processing for MLLM fine-tuning.
Use distinct teachers for perception and reasoning tasks.

Topics

Multimodal LLMs
Self-Distillation
Visual Grounding
Shortcut Learning
Perception-Reasoning Decoupling
MLLM Post-training

Best for: Research Scientist, Computer Vision Engineer, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.