Can Voice Agents Handle Bilingual Customers? Benchmarking Frontier ASR on Code-Switched Speech
Summary
A new benchmark and dataset, released through AU-Harness, evaluates Automatic Speech Recognition (ASR) models on code-switched speech, a common communication pattern for over half the world's population. Focusing on four language pairs (Spanish-English, French-English, Canadian French-English, German-English) in HR and IT Service Management scenarios, the benchmark uses synthetic audio generated by LLMs and TTS, validated by linguists. It assesses seven ASR systems, including Large Audio Language Models (LALMs) and frontier ASRs, using Word Error Rate (WER), Semantic Word Error Rate (SWER), and Answer Error Rate (AER). Key findings indicate that ElevenLabs Scribe V2, Gemini 3 Flash, and Assembly AI Universal 3-Pro are top performers, exhibiting surprisingly small performance penalties for code-switching relative to monolingual baselines. Errors tend to concentrate on the English portions of utterances, and the number of language switches and Code-Mixing Index (CMI) influence error occurrence and magnitude.
Key takeaway
For AI/NLP Engineers deploying voice agents for bilingual customer bases, you must benchmark ASR systems against your specific code-switched language pairs. While top models like ElevenLabs Scribe V2, Gemini 3 Flash, and Assembly AI Universal 3-Pro show robust performance, the "best" choice varies. Your evaluation should include semantic metrics like SWER and AER, not just WER, to ensure meaning preservation for downstream tasks. Be aware that auto-detection may not fully optimize performance compared to explicit language configurations.
Key insights
Frontier ASR models handle code-switched speech with minimal performance degradation, but error patterns vary by language pair and model.
Principles
- Code-switching cost varies significantly across language pairs and ASR models.
- Transcription errors in code-switched speech concentrate on embedded English segments.
- Number of language switches impacts error likelihood; Code-Mixing Index (CMI) influences error magnitude.
Method
A benchmark pipeline generates synthetic code-switched audio using LLMs and TTS, then evaluates ASR models with WER, SWER (Gemma-4-31B judge), and AER (LLM-based Q&A) metrics.
In practice
- Benchmark ASR systems using AU-Harness for code-switched scenarios.
- Evaluate models with semantic metrics (SWER, AER) beyond just WER.
- Analyze error patterns focusing on embedded language segments.
Topics
- Automatic Speech Recognition
- Code-Switched Speech
- ASR Benchmarking
- Voice Agents
- Semantic Error Rate
- Large Audio Language Models
Code references
Best for: CTO, VP of Engineering/Data, Director of AI/ML, Machine Learning Engineer, NLP Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Hugging Face - Blog.