MultiToP: Learning to Patch Visual Tokens to Mitigate Hallucinations in Video Large Multimodal Models

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision, Natural Language Processing · Depth: Expert, extended

Summary

MultiToP is a novel multimodal-context-aware visual token patching framework designed to mitigate hallucinations in Video Large Multimodal Models (VideoLMMs). This framework refines unreliable visual tokens prior to language generation using a lightweight Visual Token Patcher. The patcher predicts token-level replacement distributions and substitutes misleading tokens with a dynamic global patch token. Its training incorporates information-guided rank calibration, leveraging answer-conditioned frame-level cues from the VideoLMM backbone, alongside ground-truth answer supervision and sparsity regularization. MultiToP operates without modifying the original VideoLMM, introducing negligible inference overhead. Experiments show it boosts Qwen3-VL-4B-Instruct's F1 score on Vript-HAL by 50.60% and improves Video-LLaVA-7B's ActivityNet-QA accuracy by 18.58%, while preserving general video understanding and demonstrating robustness across model scales.

Key takeaway

For AI Scientists and Machine Learning Engineers deploying VideoLMMs, MultiToP offers a compelling strategy to mitigate hallucinations without costly model retraining or significant inference overhead. You should consider integrating this token-level visual evidence refinement approach to enhance model reliability and factual consistency. This method improves performance on hallucination benchmarks like Vript-HAL while preserving general video understanding, making it ideal for applications demanding high accuracy and efficiency. Be aware that the current fixed replacement ratio might require tuning for optimal performance across diverse video content.

Key insights

Hallucinations in VideoLMMs can be mitigated by selectively patching unreliable visual tokens before language generation.

Principles

Method

MultiToP employs a Visual Token Patcher to predict replacement distributions and generate a dynamic global patch token. It's optimized via cross-entropy, information-guided rank calibration, and sparsity regularization.

In practice

Topics

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.