KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMs
Summary
Haechan Kim and colleagues introduce KVoiceBench, KOpenAudioBench, and KMMAU, three publicly released Korean speech benchmarks designed to address the English-centric evaluation limitations of Speech Language Models (SpeechLMs). These benchmarks, comprising 12,345 samples, were constructed using two human-agent frameworks: one for transferring source-language SpokenQA benchmarks to target-language SpokenQA, and another for converting target-language ASR corpora into audio understanding benchmarks using transcriptions and speaker metadata. The authors evaluated eight recent SpeechLMs, finding significant English-Korean performance disparities across models and task families. Their analysis also revealed that SpokenQA and audio understanding tasks yielded divergent rankings, highlighting complementary weaknesses not apparent in English-only evaluations.
Key takeaway
For Machine Learning Engineers developing or deploying multilingual SpeechLMs, you should integrate language-specific benchmarks like KVoiceBench, KOpenAudioBench, and KMMAU into your evaluation pipeline. Relying solely on English benchmarks will mask critical performance gaps and task-specific weaknesses in non-English languages, particularly for Korean. Your evaluation strategy must account for language-specific instructions, speaker attributes, and paralinguistic properties to accurately assess model capabilities and ensure robust real-world performance.
Key insights
Multilingual SpeechLM evaluation requires language-specific benchmarks to reveal true performance gaps and task-specific weaknesses.
Principles
- Direct benchmark transfer corrupts language-specific instructions.
- Source-language audio transfer fails to preserve target-language speaker attributes.
- English-only evaluation obscures multilingual SpeechLM weaknesses.
Method
Two human-agent frameworks: one transfers SpokenQA benchmarks, the other converts ASR corpora with transcriptions and speaker metadata into audio understanding benchmarks.
In practice
- Use KVoiceBench for Korean SpokenQA.
- Apply KMMAU for Korean audio understanding.
- Evaluate SpeechLMs on diverse language-specific tasks.
Topics
- Speech Language Models
- Multilingual Benchmarking
- Korean Speech Processing
- Spoken Question Answering
- Audio Understanding
- Benchmark Construction Frameworks
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.