The Voice Behind the Words: Quantifying Intersectional Bias in SpeechLLMs

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, long

Summary

A large-scale evaluation quantified intersectional accent and gender bias in Speech Large Language Models (SpeechLLMs). Researchers used 2,880 controlled interactions across six English accents and two gender presentations, keeping linguistic content constant via voice cloning with MegaTTS3. Three SpeechLLMs—LFMAudio2-1.5B, OmniVinci, and Qwen3-Omni-30B-A3B-Instruct—were tested. The study found consistent disparities: Eastern European-accented speech received lower helpfulness scores, particularly for female-presenting voices, with Eastern European female voices scoring a mean helpfulness of 3.15, 0.47 points below Southern British female voices (3.62). This bias is implicit, manifesting as less specific or actionable advice rather than impoliteness. While LLM judges (using gemini-3-flash-preview) captured the directional trend, human evaluators demonstrated significantly higher sensitivity, uncovering sharper intersectional disparities and confirming genuine quality differences.

Key takeaway

For NLP Engineers developing or deploying SpeechLLMs, you must recognize that implicit, intersectional biases can significantly degrade response utility for specific demographic groups, even when politeness is maintained. Your bias evaluations should move beyond proxy metrics and integrate human validation, especially Best–Worst Scaling, to accurately detect subtle helpfulness gaps. This ensures your models provide equitable and actionable advice across all user identities.

Key insights

SpeechLLMs exhibit implicit intersectional bias, providing less helpful responses to specific accent-gender combinations, requiring human evaluation for full detection.

Principles

Method

The study used voice cloning to control linguistic content while varying accent and perceived gender. It combined pointwise LLM-judge ratings, pairwise comparisons, and Best–Worst Scaling (BWS) with human validation to detect subtle response quality shifts.

In practice

Topics

Best for: Research Scientist, AI Product Manager, AI Scientist, NLP Engineer, AI Ethicist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.