Analyzing Error Propagation in Korean Spoken QA with ASR-LLM Cascades

2026-05-17 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, long

Summary

An analysis of error propagation in Korean Spoken Question Answering (SQA) with ASR-LLM cascades reveals that downstream performance degradation is primarily driven by information loss at the Automatic Speech Recognition (ASR) stage, rather than the Large Language Model (LLM) capability. The study, using synthesized Korean speech with varying noise levels (0.03–0.50 character error rate), Whisper-large-v3 ASR, and four instruction-tuned LLMs (Qwen2.5-7B/32B-Instruct, SOLAR-10.7B-Instruct, EXAONE-3.5-32B-Instruct), found consistent relative degradation across LLMs. A critical finding is that single-character Korean ASR errors act as a distinct semantic-failure channel, causing the gold answer to be entirely absent in 12.5% of 1,206 analyzed cases. Furthermore, a large audio language model (Qwen2.5-Omni-7B-Instruct) significantly outperformed the ASR-LLM pipeline, showing average gains of +0.058 F1 / +0.055 EM. Conversely, an ASR-aware disclaimer prompt did not reliably improve noisy QA performance.

Key takeaway

For NLP Engineers developing Korean spoken QA systems, prioritize improving ASR accuracy and robustness, as ASR-stage information loss is the main performance bottleneck. Your efforts on LLM prompting, such as ASR-aware disclaimers, are unlikely to yield significant gains. Instead, explore direct audio language models to bypass transcription errors, especially given the high semantic impact of single-character ASR mistakes in Korean.

Key insights

ASR-stage information loss, especially single-character errors in Korean, is the primary bottleneck for ASR-LLM cascade performance.

Principles

ASR error impact on LLM performance is consistent across LLM capabilities.
Minimal ASR transcription differences can cause complete semantic failure.
Direct audio input can mitigate ASR-induced information loss.

Method

The study synthesized Korean speech, applied noise at seven SNR levels (0.03–0.50 CER), transcribed with ASR, and fed transcripts to LLMs for SQA evaluation, comparing ASR-LLM cascades with a direct audio language model and disclaimer prompts.

In practice

Prioritize ASR robustness over LLM prompting for SQA.
Consider direct audio language models for noisy speech inputs.
Be aware of single-character ASR error sensitivity in Korean.

Topics

Korean Spoken QA
ASR-LLM Cascades
Error Propagation
Large Audio Language Models
Character Error Rate
Qwen2.5-Omni-7B-Instruct

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.