PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects
Summary
PolySpeech-100 is a new, large-scale benchmark designed to evaluate "native-level" speech comprehension across 110 linguistic variants, addressing limitations of existing benchmarks that are biased towards high-resource languages, focus on ASR, and neglect regional dialects. This benchmark employs a novel hybrid construction pipeline, combining gold-standard human recordings with instruction-driven synthetic speech, to cover 19 distinct Chinese dialects and over 80 low-resource languages. Extensive evaluation of 22 state-of-the-art models, including Gemini-3, GPT-Audio, and Qwen2.5-Omni, yielded several pivotal insights. Open-source End-to-End (E2E) Speech-LLMs outperformed Cascade (ASR+LLM) systems on heavy dialects, indicating direct audio processing preserves paralinguistic cues. A significant performance gap was observed, with commercial models maintaining robustness while open-source models degraded catastrophically on low-resource languages. Counter-intuitively, Chain-of-Thought prompting often degraded speech understanding in zero-shot settings, suggesting a modality alignment gap.
Key takeaway
For Machine Learning Engineers developing Speech-LLMs, this benchmark highlights critical performance disparities. You should prioritize End-to-End architectures for robust dialect understanding, as they preserve crucial paralinguistic cues. Be aware that open-source models may catastrophically fail on low-resource languages, necessitating targeted fine-tuning or commercial alternatives. Furthermore, re-evaluate your use of Chain-of-Thought prompting for speech tasks, as it can surprisingly degrade understanding in zero-shot contexts.
Key insights
PolySpeech-100 reveals E2E Speech-LLMs excel on dialects, but open-source models struggle with low-resource languages, and CoT prompting can degrade performance.
Principles
- Direct audio processing preserves paralinguistic cues.
- Commercial Speech-LLMs show greater robustness.
- Chain-of-Thought prompting can hinder speech understanding.
Method
PolySpeech-100 uses a hybrid pipeline augmenting human recordings with instruction-driven synthetic speech to cover 110 linguistic variants, including 19 Chinese dialects and 80+ low-resource languages.
In practice
- Prioritize E2E models for dialect-rich audio.
- Re-evaluate CoT prompting for speech tasks.
- Test open-source models rigorously on low-resource data.
Topics
- PolySpeech-100
- Speech-LLMs
- Multilingual Speech Understanding
- Low-Resource Languages
- Dialect Recognition
- Chain-of-Thought Prompting
Code references
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.