Analyzing Error Propagation in Korean Spoken QA with ASR-LLM Cascades
Summary
An analysis of error propagation in Korean Spoken Question Answering (SQA) with ASR-LLM cascades reveals that downstream performance degradation is primarily driven by information loss at the Automatic Speech Recognition (ASR) stage, rather than the Large Language Model (LLM) capability. The study, using synthesized Korean speech with varying noise levels (0.03–0.50 character error rate), Whisper-large-v3 ASR, and four instruction-tuned LLMs (Qwen2.5-7B/32B-Instruct, SOLAR-10.7B-Instruct, EXAONE-3.5-32B-Instruct), found consistent relative degradation across LLMs. A critical finding is that single-character Korean ASR errors act as a distinct semantic-failure channel, causing the gold answer to be entirely absent in 12.5% of 1,206 analyzed cases. Furthermore, a large audio language model (Qwen2.5-Omni-7B-Instruct) significantly outperformed the ASR-LLM pipeline, showing average gains of +0.058 F1 / +0.055 EM. Conversely, an ASR-aware disclaimer prompt did not reliably improve noisy QA performance.
Key takeaway
For NLP Engineers developing Korean spoken QA systems, prioritize improving ASR accuracy and robustness, as ASR-stage information loss is the main performance bottleneck. Your efforts on LLM prompting, such as ASR-aware disclaimers, are unlikely to yield significant gains. Instead, explore direct audio language models to bypass transcription errors, especially given the high semantic impact of single-character ASR mistakes in Korean.
Key insights
ASR-stage information loss, especially single-character errors in Korean, is the primary bottleneck for ASR-LLM cascade performance.
Principles
- ASR error impact on LLM performance is consistent across LLM capabilities.
- Minimal ASR transcription differences can cause complete semantic failure.
- Direct audio input can mitigate ASR-induced information loss.
Method
The study synthesized Korean speech, applied noise at seven SNR levels (0.03–0.50 CER), transcribed with ASR, and fed transcripts to LLMs for SQA evaluation, comparing ASR-LLM cascades with a direct audio language model and disclaimer prompts.
In practice
- Prioritize ASR robustness over LLM prompting for SQA.
- Consider direct audio language models for noisy speech inputs.
- Be aware of single-character ASR error sensitivity in Korean.
Topics
- Korean Spoken QA
- ASR-LLM Cascades
- Error Propagation
- Large Audio Language Models
- Character Error Rate
- Qwen2.5-Omni-7B-Instruct
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.