ViCuR: Visual Cues as Recoverable Privilege for Multimodal On-Policy Distillation
Summary
ViCuR, a visually grounded privileged-teacher distillation framework, addresses the train-test mismatch in multimodal on-policy distillation (OPD) caused by answer-side privileged teachers. Existing methods, which use training-time-only signals like reference answers, encourage shortcut imitation rather than visually grounded reasoning. ViCuR replaces this with "visual cues"—query-related evidence from the input—whose source is accessible at inference. It employs a lightweight "cue recovery module" with dedicated sink-token cross-attention during prefill to aggregate task-relevant visual evidence internally, without altering the inference interface or requiring auxiliary cue-generation losses. Across seven benchmarks, ViCuR consistently improves over answer-based on-policy self-distillation by +1.19 (Qwen3-VL-2B) and +1.24 (Qwen3-VL-8B) in average performance. It also extends to stronger-teacher OPD, outperforming baselines by +0.64 and +1.08, with notable out-of-domain gains at the 8B scale.
Key takeaway
For Machine Learning Engineers developing multimodal large language models, if you are employing on-policy distillation, you should prioritize visually grounded privilege over answer-based signals. Traditional answer-side privilege can induce a train-test mismatch, leading to shortcut learning. Instead, integrate recoverable visual cues and a cue recovery module like ViCuR's sink-token cross-attention to foster genuine visual reasoning, improving performance by over +1% and enhancing out-of-domain generalization across benchmarks.
Key insights
Visually grounded cues, recoverable by the student, enhance multimodal reasoning distillation more effectively than answer-side privilege.
Principles
- Teacher privilege design impacts distillation as much as teacher strength.
- Privilege should align with student's inference-time access.
- Answer-side privilege can induce train-test mismatch.
Method
ViCuR uses a sink-token cross-attention module during prefill to aggregate task-relevant visual evidence into an internal representation, replacing answer-side privilege with visual cues.
In practice
- Replace answer-based privilege with visual cues.
- Implement sink-token cross-attention for cue recovery.
- Focus on visually grounded evidence for multimodal tasks.
Topics
- ViCuR
- On-Policy Distillation
- Multimodal Reasoning
- Visual Cues
- Sink-Token Attention
- Qwen3-VL
Code references
Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.