KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMs

2026-05-27 · Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, medium

Summary

Haechan Kim and colleagues introduce KVoiceBench, KOpenAudioBench, and KMMAU, three publicly released Korean speech benchmarks designed to address the English-centric evaluation limitations of Speech Language Models (SpeechLMs). These benchmarks, comprising 12,345 samples, were constructed using two human-agent frameworks: one for transferring source-language SpokenQA benchmarks to target-language SpokenQA, and another for converting target-language ASR corpora into audio understanding benchmarks using transcriptions and speaker metadata. The authors evaluated eight recent SpeechLMs, finding significant English-Korean performance disparities across models and task families. Their analysis also revealed that SpokenQA and audio understanding tasks yielded divergent rankings, highlighting complementary weaknesses not apparent in English-only evaluations.

Key takeaway

For Machine Learning Engineers developing or deploying multilingual SpeechLMs, you should integrate language-specific benchmarks like KVoiceBench, KOpenAudioBench, and KMMAU into your evaluation pipeline. Relying solely on English benchmarks will mask critical performance gaps and task-specific weaknesses in non-English languages, particularly for Korean. Your evaluation strategy must account for language-specific instructions, speaker attributes, and paralinguistic properties to accurately assess model capabilities and ensure robust real-world performance.

Key insights

Multilingual SpeechLM evaluation requires language-specific benchmarks to reveal true performance gaps and task-specific weaknesses.

Principles

Direct benchmark transfer corrupts language-specific instructions.
Source-language audio transfer fails to preserve target-language speaker attributes.
English-only evaluation obscures multilingual SpeechLM weaknesses.

Method

Two human-agent frameworks: one transfers SpokenQA benchmarks, the other converts ASR corpora with transcriptions and speaker metadata into audio understanding benchmarks.

In practice

Use KVoiceBench for Korean SpokenQA.
Apply KMMAU for Korean audio understanding.
Evaluate SpeechLMs on diverse language-specific tasks.

Topics

Speech Language Models
Multilingual Benchmarking
Korean Speech Processing
Spoken Question Answering
Audio Understanding
Benchmark Construction Frameworks

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.