MACD: Model-Aware Contrastive Decoding via Counterfactual Data

2026-06-08 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, medium

Summary

Model-aware Contrastive Decoding (MACD) is a novel inference strategy designed to combat hallucinations in Video Large Language Models (Video-LLMs). Video-LLMs, such as those from the Qwen and InternVL families, often generate ungrounded content when visual evidence is weak or ambiguous. Unlike traditional contrastive decoding (CD) methods that rely on random perturbations, MACD leverages the Video-LLM's own feedback to identify specific object regions responsible for hallucination. It then generates targeted, object-level counterfactual inputs, which are integrated into the CD process to enforce evidence-grounded token selection. Experiments on EventHallusion, MVBench, Perception-test, and Video-MME benchmarks demonstrate that MACD consistently reduces hallucination while maintaining or improving task accuracy, proving particularly effective for challenging scenarios involving small, occluded, or co-occurring objects.

Key takeaway

For AI Scientists or ML Engineers deploying Video-LLMs, if you are encountering issues with model hallucination, MACD offers a robust inference-time solution. You should consider integrating this method to significantly enhance factual accuracy and reliability, especially when dealing with ambiguous visual evidence or complex object interactions. MACD improves output grounding without requiring model retraining, making it a practical approach for immediate deployment.

Key insights

MACD uses model feedback to create targeted counterfactual data, improving contrastive decoding for Video-LLM hallucination.

Principles

Video-LLM hallucination stems from weak visual cues.
CD effectiveness relies on relevant perturbed data.
Model-aware feedback improves counterfactual data.

Method

Identify objects, assign soft masks, compute Video-LLM prediction loss, use gradients to update masks for counterfactual video, then perform contrastive decoding with original and counterfactual inputs.

In practice

Apply to Qwen and InternVL Video-LLMs.
Enhance reliability for small, occluded objects.
Improve video QA, reasoning, and captioning.

Topics

Video-LLMs
Hallucination Mitigation
Contrastive Decoding
Counterfactual Data
Inference Optimization
Object-level Analysis
Qwen, InternVL

Best for: Research Scientist, Computer Vision Engineer, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.