MultiToP: Learning to Patch Visual Tokens to Mitigate Hallucinations in Video Large Multimodal Models

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision, Natural Language Processing · Depth: Expert, quick

Summary

MultiToP is a multimodal-context-aware visual token patching framework designed to mitigate hallucinations in Video Large Multimodal Models (Video-LMMs). It refines unreliable visual tokens before language generation by introducing a lightweight Visual Token Patcher. This patcher predicts token-level replacement distributions and selectively substitutes unreliable visual tokens with a dynamic global patch token. MultiToP employs information-guided rank calibration, using answer-conditioned frame-level information cues from the backbone, combined with ground-truth answer supervision and sparsity regularization for effective training. Experiments show MultiToP reduces hallucinations on Vript-HAL, improving Qwen3-VL-4B-Instruct F1 scores by 50.60% over the vanilla model with negligible inference overhead, while also preserving general video understanding, yielding an 18.58% relative accuracy gain on ActivityNet-QA for Video-LLaVA-7B.

Key takeaway

For AI Scientists and Machine Learning Engineers working on video understanding with Video-LMMs, MultiToP offers a practical approach to combat model hallucinations. This framework significantly improves factual consistency, demonstrated by a 50.60% F1 score increase for Qwen3-VL-4B-Instruct on Vript-HAL, without compromising general understanding or incurring substantial inference overhead. You should consider integrating token-patching techniques like MultiToP to enhance the reliability and trustworthiness of your Video-LMM deployments.

Key insights

MultiToP patches unreliable visual tokens in Video-LMMs to reduce hallucinations while preserving general understanding.

Principles

Method

MultiToP uses a Visual Token Patcher to predict token replacement distributions, substituting unreliable visual tokens with a dynamic global patch token, guided by answer-conditioned frame-level information.

In practice

Topics

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.