When Vision Speaks for Sound
Summary
Current video-capable Multimodal Large Language Models (MLLMs) often exhibit an "audio-visual Clever Hans effect," where they infer or hallucinate acoustic information from visual cues rather than verifying the actual audio stream. This behavior, observed in both open-source models like MiniCPM-o-4.5 and Qwen3-Omni, and closed-source models such as Google's Gemini and OpenAI's GPT-5.5, leads to models appearing audio-grounded when they are actually exploiting visual-acoustic correlations. To diagnose this, researchers introduced Thud, an intervention-driven probing framework using three counterfactual audio edits: Shift (temporal synchronization), Mute (sound existence), and Swap (audio-visual consistency). A two-stage alignment recipe, combining intervention-derived preference pairs with event-level general video preferences, improved average performance across these three intervention dimensions by 28 percentage points, while also slightly enhancing general video and audio-visual QA benchmarks.
Key takeaway
Research Scientists developing or evaluating video-capable MLLMs should prioritize diagnostic testing beyond naturally correlated videos. You must employ counterfactual audio-visual interventions like Thud's Shift, Mute, and Swap to uncover visual-semantic shortcuts and ensure genuine audio-visual grounding. Integrate intervention-derived preference pairs into your model's alignment training to improve audio verification and prevent hallucination, thereby enhancing model robustness in real-world applications where audio accuracy is critical.
Key insights
MLLMs often hallucinate audio from visual cues, exhibiting a "Clever Hans effect" instead of true audio-visual grounding.
Principles
- Models exploit visual-acoustic correlations without verifying audio.
- Controlled interventions expose hidden model shortcuts.
- Targeted training can mitigate audio-visual shortcut reliance.
Method
Thud uses Shift, Mute, and Swap interventions to create counterfactual audio edits. A two-stage alignment recipe combines intervention-derived preference pairs with general video data for post-training.
In practice
- Use Thud's Shift, Mute, and Swap interventions for MLLM diagnostics.
- Implement preference-based alignment with counterfactual data.
- Combine intervention data with general video data to prevent over-specialization.
Topics
- Audio-Visual Clever Hans Effect
- Multimodal Large Language Models
- Thud Diagnostic Framework
- Counterfactual Audio Interventions
- Preference Alignment
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.