MultiToP: Learning to Patch Visual Tokens to Mitigate Hallucinations in Video Large Multimodal Models
Summary
MultiToP is a multimodal-context-aware visual token patching framework designed to mitigate hallucinations in Video Large Multimodal Models (Video-LMMs). It refines unreliable visual tokens before language generation by introducing a lightweight Visual Token Patcher. This patcher predicts token-level replacement distributions and selectively substitutes unreliable visual tokens with a dynamic global patch token. MultiToP employs information-guided rank calibration, using answer-conditioned frame-level information cues from the backbone, combined with ground-truth answer supervision and sparsity regularization for effective training. Experiments show MultiToP reduces hallucinations on Vript-HAL, improving Qwen3-VL-4B-Instruct F1 scores by 50.60% over the vanilla model with negligible inference overhead, while also preserving general video understanding, yielding an 18.58% relative accuracy gain on ActivityNet-QA for Video-LLaVA-7B.
Key takeaway
For AI Scientists and Machine Learning Engineers working on video understanding with Video-LMMs, MultiToP offers a practical approach to combat model hallucinations. This framework significantly improves factual consistency, demonstrated by a 50.60% F1 score increase for Qwen3-VL-4B-Instruct on Vript-HAL, without compromising general understanding or incurring substantial inference overhead. You should consider integrating token-patching techniques like MultiToP to enhance the reliability and trustworthiness of your Video-LMM deployments.
Key insights
MultiToP patches unreliable visual tokens in Video-LMMs to reduce hallucinations while preserving general understanding.
Principles
- Localized visual evidence refinement is key.
- Token-level replacement improves reliability.
- Information-guided calibration enhances patching.
Method
MultiToP uses a Visual Token Patcher to predict token replacement distributions, substituting unreliable visual tokens with a dynamic global patch token, guided by answer-conditioned frame-level information.
In practice
- Apply token patching to Video-LMM outputs.
- Use information cues for token refinement.
- Integrate sparsity regularization in training.
Topics
- Video Large Multimodal Models
- Hallucination Mitigation
- Visual Token Patching
- Multimodal AI
- Video Understanding
- Qwen3-VL-4B-Instruct
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.