Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation
Summary
Vision-OPD (Vision On-Policy Distillation) is a new regional-to-global self-distillation framework designed to improve Multimodal Large Language Models' (MLLMs) fine-grained visual understanding. MLLMs often struggle with details in full images, performing better when presented with evidence-centered crops. This suggests a "regional-to-global perception gap," where the issue is focus rather than local recognition. Vision-OPD addresses this by training an MLLM to transfer its own superior crop-conditioned perception to its full-image policy. It uses the same MLLM to create a crop-conditioned teacher and a full-image-conditioned student, minimizing token-level divergence between their next-token distributions during on-policy rollouts. This method allows the model to learn visual zooming benefits without external teachers, ground-truth labels, reward verifiers, or inference-time tools. Experiments demonstrate that Vision-OPD models achieve competitive or superior performance on several fine-grained visual understanding benchmarks compared to larger open-source, closed-source, and agentic models.
Key takeaway
For AI Engineers developing MLLMs that require precise visual understanding, Vision-OPD offers a robust self-distillation framework to enhance fine-grained perception. You should consider implementing this regional-to-global approach to improve model accuracy on detail-oriented tasks without needing external teachers or extensive new labels, potentially streamlining your development workflow and reducing data annotation costs.
Key insights
Self-distillation from regional crops to full images improves MLLM fine-grained visual understanding.
Principles
- MLLMs have a regional-to-global perception gap.
- Focusing on evidence is key for fine-grained tasks.
Method
Vision-OPD uses an MLLM as both a crop-conditioned teacher and a full-image student, minimizing token-level divergence during on-policy rollouts to transfer regional perception to full-image policy.
In practice
- Improve MLLM detail recognition.
- Reduce reliance on external training data.
Topics
- Multimodal Large Language Models
- Fine-grained Visual Understanding
- Regional-to-Global Perception Gap
- Vision-OPD
- On-Policy Self-Distillation
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.