DEAF: A Benchmark for Diagnostic Evaluation of Acoustic Faithfulness in Audio Language Models

2026-03-20 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, extended

Summary

A new benchmark called DEAF (Diagnostic Evaluation of Acoustic Faithfulness) has been introduced to assess whether Audio Multimodal Large Language Models (Audio MLLMs) genuinely process acoustic signals or primarily rely on text-based semantic inference. The benchmark comprises over 2,700 conflict stimuli across three acoustic dimensions: emotional prosody (ESC), background sounds (BSC), and speaker identity (SIC). It employs a controlled, multi-level evaluation framework that progressively increases textual influence, from semantic conflicts within the content to misleading prompts and their combination. Diagnostic metrics like Acoustic Robustness Score (ARS) and Environment Discrimination Index (EDI) quantify model reliance on textual cues over acoustic signals. Evaluation of seven Audio MLLMs, including Gemini-2.5 Flash, GPT-4o-Audio, and Qwen3-Omni, consistently revealed text dominance, with predictions predominantly driven by textual inputs despite models showing sensitivity to acoustic variations. ARS often degraded to near zero under dual textual interference for ESC and BSC.

Key takeaway

For research scientists developing or evaluating Audio MLLMs, you should recognize that current models often prioritize textual cues over acoustic information, even when acoustic signals are perceived. Your evaluation should move beyond standard benchmarks with naturally aligned data and incorporate conflict-based diagnostics like DEAF to truly assess acoustic faithfulness. Consider designing future models with explicit paralinguistic pretraining objectives or grounding mechanisms to mitigate this "perception-trust gap" and improve genuine acoustic understanding.

Key insights

Audio MLLMs exhibit pervasive text dominance, prioritizing textual cues over acoustic signals, even when perceiving acoustic variations.

Principles

Text dominance is a fundamental characteristic of current multimodal architectures.
Acoustic sensitivity does not equate to acoustic robustness in Audio MLLMs.
Increasing textual interference severely degrades acoustic grounding in Audio MLLMs.

Method

DEAF uses 2,700+ stimuli across three acoustic dimensions (emotion, background, speaker identity) and three levels of textual interference (semantic conflict, misleading prompt, dual interference) to diagnose text dominance in Audio MLLMs.

In practice

Evaluate Audio MLLMs using conflict-based benchmarks like DEAF.
Prioritize paralinguistic pretraining for future audio encoders.
Investigate inference-time grounding mechanisms to improve acoustic reasoning.

Topics

Audio Multimodal LLMs
Multimodal Benchmarking
Acoustic-Semantic Conflict
Text Dominance
Acoustic Robustness Score

Code references

elevenlabs/elevenlabs-python

Best for: Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.