Enhancing Video Representations with Spatiotemporal-Semantic Residual to Mitigate Hallucinations in Video Large Multimodal Models

· Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

Spatiotemporal-Semantic Contrastive Decoding (SSCD) is a novel decoding strategy designed to mitigate hallucinations in Video Large Language Models (VideoLLMs) like Video-LLaVA (8 frames) and LLaVA-NeXT-Video (16 frames), both with 7B parameters. Hallucinations, where models generate outputs inconsistent with explicit video content or factual evidence, undermine reliability. SSCD addresses this by constructing "negative" video features that deliberately disrupt spatiotemporal consistency and semantic alignment. These disrupted features are then used in a contrastive decoding process during inference to suppress hallucinated outputs, without requiring retraining the backbone model. Extensive experiments on benchmarks such as VideoHallucer, EventHallusion, and VideoHallu demonstrate SSCD's effectiveness in reducing hallucinations while preserving general video understanding and reasoning capabilities on ActivityNet-QA and MMVU.

Key takeaway

For AI Scientists and ML Engineers deploying VideoLLMs, SSCD offers a robust method to enhance model reliability by significantly reducing hallucinations. You should consider integrating this decoding strategy, which avoids costly model retraining, to improve consistency with video content. Calibrate hyperparameters like contrastive strength (alpha, e.g., 0.8 for Video-LLaVA, 0.4 for LLaVA-NeXT-Video) and plausibility (beta, e.g., 0.1) to balance hallucination mitigation with preserving general video understanding and reasoning, ensuring your models provide more accurate and trustworthy outputs in complex scenarios.

Key insights

SSCD mitigates VideoLLM hallucinations by using deliberately disrupted spatiotemporal and semantic negative features in contrastive decoding.

Principles

Method

A lightweight Spatiotemporal-Semantic Disruptor perturbs video features by weakening cross-temporal/spatial associations via random walks and minimizing conditional mutual information with the target text, then applies contrastive decoding.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.