ViCuR: Visual Cues as Recoverable Privilege for Multimodal On-Policy Distillation

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

ViCuR, a visually grounded privileged-teacher distillation framework, addresses the train-test mismatch in multimodal on-policy distillation (OPD) caused by answer-side privileged teachers. Existing methods, which use training-time-only signals like reference answers, encourage shortcut imitation rather than visually grounded reasoning. ViCuR replaces this with "visual cues"—query-related evidence from the input—whose source is accessible at inference. It employs a lightweight "cue recovery module" with dedicated sink-token cross-attention during prefill to aggregate task-relevant visual evidence internally, without altering the inference interface or requiring auxiliary cue-generation losses. Across seven benchmarks, ViCuR consistently improves over answer-based on-policy self-distillation by +1.19 (Qwen3-VL-2B) and +1.24 (Qwen3-VL-8B) in average performance. It also extends to stronger-teacher OPD, outperforming baselines by +0.64 and +1.08, with notable out-of-domain gains at the 8B scale.

Key takeaway

For Machine Learning Engineers developing multimodal large language models, if you are employing on-policy distillation, you should prioritize visually grounded privilege over answer-based signals. Traditional answer-side privilege can induce a train-test mismatch, leading to shortcut learning. Instead, integrate recoverable visual cues and a cue recovery module like ViCuR's sink-token cross-attention to foster genuine visual reasoning, improving performance by over +1% and enhancing out-of-domain generalization across benchmarks.

Key insights

Visually grounded cues, recoverable by the student, enhance multimodal reasoning distillation more effectively than answer-side privilege.

Principles

Method

ViCuR uses a sink-token cross-attention module during prefill to aggregate task-relevant visual evidence into an internal representation, replacing answer-side privilege with visual cues.

In practice

Topics

Code references

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.