Can Voice Agents Handle Bilingual Customers? Benchmarking Frontier ASR on Code-Switched Speech

2026-06-09 · Source: Hugging Face - Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, long

Summary

A new benchmark and dataset, released through AU-Harness, evaluates Automatic Speech Recognition (ASR) models on code-switched speech, a common communication pattern for over half the world's population. Focusing on four language pairs (Spanish-English, French-English, Canadian French-English, German-English) in HR and IT Service Management scenarios, the benchmark uses synthetic audio generated by LLMs and TTS, validated by linguists. It assesses seven ASR systems, including Large Audio Language Models (LALMs) and frontier ASRs, using Word Error Rate (WER), Semantic Word Error Rate (SWER), and Answer Error Rate (AER). Key findings indicate that ElevenLabs Scribe V2, Gemini 3 Flash, and Assembly AI Universal 3-Pro are top performers, exhibiting surprisingly small performance penalties for code-switching relative to monolingual baselines. Errors tend to concentrate on the English portions of utterances, and the number of language switches and Code-Mixing Index (CMI) influence error occurrence and magnitude.

Key takeaway

For AI/NLP Engineers deploying voice agents for bilingual customer bases, you must benchmark ASR systems against your specific code-switched language pairs. While top models like ElevenLabs Scribe V2, Gemini 3 Flash, and Assembly AI Universal 3-Pro show robust performance, the "best" choice varies. Your evaluation should include semantic metrics like SWER and AER, not just WER, to ensure meaning preservation for downstream tasks. Be aware that auto-detection may not fully optimize performance compared to explicit language configurations.

Key insights

Frontier ASR models handle code-switched speech with minimal performance degradation, but error patterns vary by language pair and model.

Principles

Code-switching cost varies significantly across language pairs and ASR models.
Transcription errors in code-switched speech concentrate on embedded English segments.
Number of language switches impacts error likelihood; Code-Mixing Index (CMI) influences error magnitude.

Method

A benchmark pipeline generates synthetic code-switched audio using LLMs and TTS, then evaluates ASR models with WER, SWER (Gemma-4-31B judge), and AER (LLM-based Q&A) metrics.

In practice

Benchmark ASR systems using AU-Harness for code-switched scenarios.
Evaluate models with semantic metrics (SWER, AER) beyond just WER.
Analyze error patterns focusing on embedded language segments.

Topics

Automatic Speech Recognition
Code-Switched Speech
ASR Benchmarking
Voice Agents
Semantic Error Rate
Large Audio Language Models

Code references

Best for: CTO, VP of Engineering/Data, Director of AI/ML, Machine Learning Engineer, NLP Engineer, AI Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Hugging Face - Blog.