ViCuR: Visual Cues as Recoverable Privilege for Multimodal On-Policy Distillation

2026-06-06 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

ViCuR, a visually grounded privileged-teacher distillation framework, addresses the train-test mismatch in multimodal on-policy distillation (OPD) caused by answer-side privileged teachers. Existing methods, which use training-time-only signals like reference answers, encourage shortcut imitation rather than visually grounded reasoning. ViCuR replaces this with "visual cues"—query-related evidence from the input—whose source is accessible at inference. It employs a lightweight "cue recovery module" with dedicated sink-token cross-attention during prefill to aggregate task-relevant visual evidence internally, without altering the inference interface or requiring auxiliary cue-generation losses. Across seven benchmarks, ViCuR consistently improves over answer-based on-policy self-distillation by +1.19 (Qwen3-VL-2B) and +1.24 (Qwen3-VL-8B) in average performance. It also extends to stronger-teacher OPD, outperforming baselines by +0.64 and +1.08, with notable out-of-domain gains at the 8B scale.

Key takeaway

For Machine Learning Engineers developing multimodal large language models, if you are employing on-policy distillation, you should prioritize visually grounded privilege over answer-based signals. Traditional answer-side privilege can induce a train-test mismatch, leading to shortcut learning. Instead, integrate recoverable visual cues and a cue recovery module like ViCuR's sink-token cross-attention to foster genuine visual reasoning, improving performance by over +1% and enhancing out-of-domain generalization across benchmarks.

Key insights

Visually grounded cues, recoverable by the student, enhance multimodal reasoning distillation more effectively than answer-side privilege.

Principles

Teacher privilege design impacts distillation as much as teacher strength.
Privilege should align with student's inference-time access.
Answer-side privilege can induce train-test mismatch.

Method

ViCuR uses a sink-token cross-attention module during prefill to aggregate task-relevant visual evidence into an internal representation, replacing answer-side privilege with visual cues.

In practice

Replace answer-based privilege with visual cues.
Implement sink-token cross-attention for cue recovery.
Focus on visually grounded evidence for multimodal tasks.

Topics

ViCuR
On-Policy Distillation
Multimodal Reasoning
Visual Cues
Sink-Token Attention
Qwen3-VL

Code references

tiankanghui/ViCuR

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.