Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

Vision-OPD (Vision On-Policy Distillation) is a new regional-to-global self-distillation framework designed to improve Multimodal Large Language Models' (MLLMs) fine-grained visual understanding. MLLMs often struggle with details in full images, performing better when presented with evidence-centered crops. This suggests a "regional-to-global perception gap," where the issue is focus rather than local recognition. Vision-OPD addresses this by training an MLLM to transfer its own superior crop-conditioned perception to its full-image policy. It uses the same MLLM to create a crop-conditioned teacher and a full-image-conditioned student, minimizing token-level divergence between their next-token distributions during on-policy rollouts. This method allows the model to learn visual zooming benefits without external teachers, ground-truth labels, reward verifiers, or inference-time tools. Experiments demonstrate that Vision-OPD models achieve competitive or superior performance on several fine-grained visual understanding benchmarks compared to larger open-source, closed-source, and agentic models.

Key takeaway

For AI Engineers developing MLLMs that require precise visual understanding, Vision-OPD offers a robust self-distillation framework to enhance fine-grained perception. You should consider implementing this regional-to-global approach to improve model accuracy on detail-oriented tasks without needing external teachers or extensive new labels, potentially streamlining your development workflow and reducing data annotation costs.

Key insights

Self-distillation from regional crops to full images improves MLLM fine-grained visual understanding.

Principles

Method

Vision-OPD uses an MLLM as both a crop-conditioned teacher and a full-image student, minimizing token-level divergence during on-policy rollouts to transfer regional perception to full-image policy.

In practice

Topics

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.