DEAF: A Benchmark for Diagnostic Evaluation of Acoustic Faithfulness in Audio Language Models
Summary
A new benchmark called DEAF (Diagnostic Evaluation of Acoustic Faithfulness) has been introduced to assess whether Audio Multimodal Large Language Models (Audio MLLMs) genuinely process acoustic signals or primarily rely on text-based semantic inference. The benchmark comprises over 2,700 conflict stimuli across three acoustic dimensions: emotional prosody (ESC), background sounds (BSC), and speaker identity (SIC). It employs a controlled, multi-level evaluation framework that progressively increases textual influence, from semantic conflicts within the content to misleading prompts and their combination. Diagnostic metrics like Acoustic Robustness Score (ARS) and Environment Discrimination Index (EDI) quantify model reliance on textual cues over acoustic signals. Evaluation of seven Audio MLLMs, including Gemini-2.5 Flash, GPT-4o-Audio, and Qwen3-Omni, consistently revealed text dominance, with predictions predominantly driven by textual inputs despite models showing sensitivity to acoustic variations. ARS often degraded to near zero under dual textual interference for ESC and BSC.
Key takeaway
For research scientists developing or evaluating Audio MLLMs, you should recognize that current models often prioritize textual cues over acoustic information, even when acoustic signals are perceived. Your evaluation should move beyond standard benchmarks with naturally aligned data and incorporate conflict-based diagnostics like DEAF to truly assess acoustic faithfulness. Consider designing future models with explicit paralinguistic pretraining objectives or grounding mechanisms to mitigate this "perception-trust gap" and improve genuine acoustic understanding.
Key insights
Audio MLLMs exhibit pervasive text dominance, prioritizing textual cues over acoustic signals, even when perceiving acoustic variations.
Principles
- Text dominance is a fundamental characteristic of current multimodal architectures.
- Acoustic sensitivity does not equate to acoustic robustness in Audio MLLMs.
- Increasing textual interference severely degrades acoustic grounding in Audio MLLMs.
Method
DEAF uses 2,700+ stimuli across three acoustic dimensions (emotion, background, speaker identity) and three levels of textual interference (semantic conflict, misleading prompt, dual interference) to diagnose text dominance in Audio MLLMs.
In practice
- Evaluate Audio MLLMs using conflict-based benchmarks like DEAF.
- Prioritize paralinguistic pretraining for future audio encoders.
- Investigate inference-time grounding mechanisms to improve acoustic reasoning.
Topics
- Audio Multimodal LLMs
- Multimodal Benchmarking
- Acoustic-Semantic Conflict
- Text Dominance
- Acoustic Robustness Score
Code references
Best for: Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.