Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation
Summary
ViGOS is a visually grounded On-Policy Self-Distillation (OPSD) framework designed for post-training Multimodal Large Language Models (MLLMs) to mitigate shortcut learning. Traditional OPSD, effective for LLM reasoning, can lead MLLMs to rely excessively on text reference targets rather than visual input when extended directly. ViGOS addresses this by decoupling perception and reasoning. In its two-stage process, the student MLLM first generates a visual description, then reasons towards the final answer. For valid rollouts, an image-only perception teacher supervises the visual description, while a privileged reasoning teacher guides the reasoning and final answer. A reference teacher is only employed for invalid rollouts to maintain output format. This approach preserves OPSD's benefits while enhancing image-grounded behavior in shortcut-prone scenarios across various benchmarks, including general vision-language, expert reasoning, visual math, spatial grounding, and visual-language-prior tasks.
Key takeaway
For Machine Learning Engineers developing or fine-tuning Multimodal Large Language Models, you should consider implementing a decoupled perception and reasoning architecture. This approach, exemplified by ViGOS, directly addresses the risk of MLLMs relying on text shortcuts instead of visual input. By separating the visual description and reasoning stages with specialized teachers, you can significantly improve your model's image-grounded behavior and overall robustness in complex multimodal tasks.
Key insights
Decoupling perception and reasoning in MLLMs prevents shortcut learning by ensuring visual grounding.
Principles
- Direct OPSD extension to MLLMs creates text-bias shortcuts.
- Separate teachers can enforce visual grounding and reasoning.
Method
A student MLLM first describes an image, then reasons. An image-only teacher supervises description; a reasoning teacher supervises the final answer.
In practice
- Apply two-stage processing for MLLM fine-tuning.
- Use distinct teachers for perception and reasoning tasks.
Topics
- Multimodal LLMs
- Self-Distillation
- Visual Grounding
- Shortcut Learning
- Perception-Reasoning Decoupling
- MLLM Post-training
Best for: Research Scientist, Computer Vision Engineer, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.