MultiToP: Learning to Patch Visual Tokens to Mitigate Hallucinations in Video Large Multimodal Models
Summary
MultiToP is a novel multimodal-context-aware visual token patching framework designed to mitigate hallucinations in Video Large Multimodal Models (VideoLMMs). This framework refines unreliable visual tokens prior to language generation using a lightweight Visual Token Patcher. The patcher predicts token-level replacement distributions and substitutes misleading tokens with a dynamic global patch token. Its training incorporates information-guided rank calibration, leveraging answer-conditioned frame-level cues from the VideoLMM backbone, alongside ground-truth answer supervision and sparsity regularization. MultiToP operates without modifying the original VideoLMM, introducing negligible inference overhead. Experiments show it boosts Qwen3-VL-4B-Instruct's F1 score on Vript-HAL by 50.60% and improves Video-LLaVA-7B's ActivityNet-QA accuracy by 18.58%, while preserving general video understanding and demonstrating robustness across model scales.
Key takeaway
For AI Scientists and Machine Learning Engineers deploying VideoLMMs, MultiToP offers a compelling strategy to mitigate hallucinations without costly model retraining or significant inference overhead. You should consider integrating this token-level visual evidence refinement approach to enhance model reliability and factual consistency. This method improves performance on hallucination benchmarks like Vript-HAL while preserving general video understanding, making it ideal for applications demanding high accuracy and efficiency. Be aware that the current fixed replacement ratio might require tuning for optimal performance across diverse video content.
Key insights
Hallucinations in VideoLMMs can be mitigated by selectively patching unreliable visual tokens before language generation.
Principles
- Repairing misleading visual tokens can mitigate hallucinations.
- VideoLMM-derived frame-level cues guide token replacement.
- Relative ranking objectives are robust to noisy attention.
Method
MultiToP employs a Visual Token Patcher to predict replacement distributions and generate a dynamic global patch token. It's optimized via cross-entropy, information-guided rank calibration, and sparsity regularization.
In practice
- Refine visual tokens at the token-level pre-generation.
- Derive information cues from answer-to-visual attention.
- Use Gumbel-Softmax for differentiable token replacement.
Topics
- Video Large Multimodal Models
- Hallucination Mitigation
- Visual Token Patching
- Qwen3-VL-4B-Instruct
- Computational Efficiency
- Video Understanding
Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.