When Vision Speaks for Sound

2026-04-22 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Multimodal AI · Depth: Expert, extended

Summary

Current video-capable Multimodal Large Language Models (MLLMs) often exhibit an "audio-visual Clever Hans effect," where they infer or hallucinate acoustic information from visual cues rather than verifying the actual audio stream. This behavior, observed in both open-source models like MiniCPM-o-4.5 and Qwen3-Omni, and closed-source models such as Google's Gemini and OpenAI's GPT-5.5, leads to models appearing audio-grounded when they are actually exploiting visual-acoustic correlations. To diagnose this, researchers introduced Thud, an intervention-driven probing framework using three counterfactual audio edits: Shift (temporal synchronization), Mute (sound existence), and Swap (audio-visual consistency). A two-stage alignment recipe, combining intervention-derived preference pairs with event-level general video preferences, improved average performance across these three intervention dimensions by 28 percentage points, while also slightly enhancing general video and audio-visual QA benchmarks.

Key takeaway

Research Scientists developing or evaluating video-capable MLLMs should prioritize diagnostic testing beyond naturally correlated videos. You must employ counterfactual audio-visual interventions like Thud's Shift, Mute, and Swap to uncover visual-semantic shortcuts and ensure genuine audio-visual grounding. Integrate intervention-derived preference pairs into your model's alignment training to improve audio verification and prevent hallucination, thereby enhancing model robustness in real-world applications where audio accuracy is critical.

Key insights

MLLMs often hallucinate audio from visual cues, exhibiting a "Clever Hans effect" instead of true audio-visual grounding.

Principles

Models exploit visual-acoustic correlations without verifying audio.
Controlled interventions expose hidden model shortcuts.
Targeted training can mitigate audio-visual shortcut reliance.

Method

Thud uses Shift, Mute, and Swap interventions to create counterfactual audio edits. A two-stage alignment recipe combines intervention-derived preference pairs with general video data for post-training.

In practice

Use Thud's Shift, Mute, and Swap interventions for MLLM diagnostics.
Implement preference-based alignment with counterfactual data.
Combine intervention data with general video data to prevent over-specialization.

Topics

Audio-Visual Clever Hans Effect
Multimodal Large Language Models
Thud Diagnostic Framework
Counterfactual Audio Interventions
Preference Alignment

Code references

rakanWen/wvs-code

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.