The Voice Behind the Words: Quantifying Intersectional Bias in SpeechLLMs
Summary
A large-scale evaluation quantified intersectional accent and gender bias in Speech Large Language Models (SpeechLLMs). Researchers used 2,880 controlled interactions across six English accents and two gender presentations, keeping linguistic content constant via voice cloning with MegaTTS3. Three SpeechLLMs—LFMAudio2-1.5B, OmniVinci, and Qwen3-Omni-30B-A3B-Instruct—were tested. The study found consistent disparities: Eastern European-accented speech received lower helpfulness scores, particularly for female-presenting voices, with Eastern European female voices scoring a mean helpfulness of 3.15, 0.47 points below Southern British female voices (3.62). This bias is implicit, manifesting as less specific or actionable advice rather than impoliteness. While LLM judges (using gemini-3-flash-preview) captured the directional trend, human evaluators demonstrated significantly higher sensitivity, uncovering sharper intersectional disparities and confirming genuine quality differences.
Key takeaway
For NLP Engineers developing or deploying SpeechLLMs, you must recognize that implicit, intersectional biases can significantly degrade response utility for specific demographic groups, even when politeness is maintained. Your bias evaluations should move beyond proxy metrics and integrate human validation, especially Best–Worst Scaling, to accurately detect subtle helpfulness gaps. This ensures your models provide equitable and actionable advice across all user identities.
Key insights
SpeechLLMs exhibit implicit intersectional bias, providing less helpful responses to specific accent-gender combinations, requiring human evaluation for full detection.
Principles
- SpeechLLMs' end-to-end processing retains identity cues.
- Intersectional biases compound disparities in AI responses.
- Implicit bias reduces helpfulness, not politeness.
Method
The study used voice cloning to control linguistic content while varying accent and perceived gender. It combined pointwise LLM-judge ratings, pairwise comparisons, and Best–Worst Scaling (BWS) with human validation to detect subtle response quality shifts.
In practice
- Evaluate SpeechLLMs for implicit intersectional bias.
- Use human evaluators to detect subtle helpfulness gaps.
- Prioritize in-domain evaluations over proxy metrics.
Topics
- SpeechLLMs
- Intersectional Bias
- Accent Bias
- Gender Bias
- Human Evaluation
- Best–Worst Scaling
Best for: Research Scientist, AI Product Manager, AI Scientist, NLP Engineer, AI Ethicist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.