Can Voice Agents Handle Bilingual Customers? Benchmarking Frontier ASR on Code-Switched Speech

· Source: Hugging Face - Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, long

Summary

A new benchmark and dataset, released through AU-Harness, evaluates Automatic Speech Recognition (ASR) models on code-switched speech, a common communication pattern for over half the world's population. Focusing on four language pairs (Spanish-English, French-English, Canadian French-English, German-English) in HR and IT Service Management scenarios, the benchmark uses synthetic audio generated by LLMs and TTS, validated by linguists. It assesses seven ASR systems, including Large Audio Language Models (LALMs) and frontier ASRs, using Word Error Rate (WER), Semantic Word Error Rate (SWER), and Answer Error Rate (AER). Key findings indicate that ElevenLabs Scribe V2, Gemini 3 Flash, and Assembly AI Universal 3-Pro are top performers, exhibiting surprisingly small performance penalties for code-switching relative to monolingual baselines. Errors tend to concentrate on the English portions of utterances, and the number of language switches and Code-Mixing Index (CMI) influence error occurrence and magnitude.

Key takeaway

For AI/NLP Engineers deploying voice agents for bilingual customer bases, you must benchmark ASR systems against your specific code-switched language pairs. While top models like ElevenLabs Scribe V2, Gemini 3 Flash, and Assembly AI Universal 3-Pro show robust performance, the "best" choice varies. Your evaluation should include semantic metrics like SWER and AER, not just WER, to ensure meaning preservation for downstream tasks. Be aware that auto-detection may not fully optimize performance compared to explicit language configurations.

Key insights

Frontier ASR models handle code-switched speech with minimal performance degradation, but error patterns vary by language pair and model.

Principles

Method

A benchmark pipeline generates synthetic code-switched audio using LLMs and TTS, then evaluates ASR models with WER, SWER (Gemma-4-31B judge), and AER (LLM-based Q&A) metrics.

In practice

Topics

Code references

Best for: CTO, VP of Engineering/Data, Director of AI/ML, Machine Learning Engineer, NLP Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Hugging Face - Blog.