PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, quick

Summary

PolySpeech-100 is a new, large-scale benchmark designed to evaluate "native-level" speech comprehension across 110 linguistic variants, addressing limitations of existing benchmarks that are biased towards high-resource languages, focus on ASR, and neglect regional dialects. This benchmark employs a novel hybrid construction pipeline, combining gold-standard human recordings with instruction-driven synthetic speech, to cover 19 distinct Chinese dialects and over 80 low-resource languages. Extensive evaluation of 22 state-of-the-art models, including Gemini-3, GPT-Audio, and Qwen2.5-Omni, yielded several pivotal insights. Open-source End-to-End (E2E) Speech-LLMs outperformed Cascade (ASR+LLM) systems on heavy dialects, indicating direct audio processing preserves paralinguistic cues. A significant performance gap was observed, with commercial models maintaining robustness while open-source models degraded catastrophically on low-resource languages. Counter-intuitively, Chain-of-Thought prompting often degraded speech understanding in zero-shot settings, suggesting a modality alignment gap.

Key takeaway

For Machine Learning Engineers developing Speech-LLMs, this benchmark highlights critical performance disparities. You should prioritize End-to-End architectures for robust dialect understanding, as they preserve crucial paralinguistic cues. Be aware that open-source models may catastrophically fail on low-resource languages, necessitating targeted fine-tuning or commercial alternatives. Furthermore, re-evaluate your use of Chain-of-Thought prompting for speech tasks, as it can surprisingly degrade understanding in zero-shot contexts.

Key insights

PolySpeech-100 reveals E2E Speech-LLMs excel on dialects, but open-source models struggle with low-resource languages, and CoT prompting can degrade performance.

Principles

Method

PolySpeech-100 uses a hybrid pipeline augmenting human recordings with instruction-driven synthetic speech to cover 110 linguistic variants, including 19 Chinese dialects and 80+ low-resource languages.

In practice

Topics

Code references

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.