Towards Truly Multilingual ASR: Generalizing Code-Switching ASR to Unseen Language Pairs
Summary
This research investigates generalizing code-switching Automatic Speech Recognition (CS-ASR) capabilities from seen to unseen language pairs, addressing the scarcity of multilingual CS speech resources. Using Whisper-medium as the backbone, the study fine-tuned models on English-centric pairs (Korean-English, Japanese-English, German-English) and evaluated their transferability to unseen non-English pairs like Korean-Japanese and Korean-German. For these unseen pairs, novel evaluation datasets were constructed, comprising 450 Korean-Japanese and 387 Korean-German utterances. Experiments explored model merging (Task Arithmetic, TIES, DARE) and domain generalization (Fish, Fishr, GGA) techniques. Results indicate that while fine-tuning and these generalization methods offer modest improvements, the gains are limited, with an average Mixed Error Rate (MER) of 0.32 on unseen pairs, still far from the sub-0.2 MER on seen pairs. Layer-wise analysis revealed CS adaptation primarily in higher encoder and decoder layers.
Key takeaway
For AI Scientists developing multilingual ASR systems, recognize that current model merging and domain generalization techniques offer only limited transferability for code-switching to unseen language pairs. You should prioritize developing CS-ASR architectures and adaptation strategies specifically designed for robust cross-pair generalization, rather than relying on existing general-purpose methods. Consider contributing to or utilizing more diverse, higher-quality multilingual code-switching speech datasets to advance practical deployment.
Key insights
Code-switching ASR generalization to unseen language pairs remains limited despite model merging and domain generalization efforts.
Principles
- Code-switching ASR capabilities transfer modestly across language pairs.
- CS adaptation primarily modifies higher-level semantic and linguistic representations.
- Naive application of general-purpose domain generalization methods is insufficient for CS-ASR.
Method
The study fine-tuned Whisper-medium on seen bilingual CS datasets, then applied model merging (Task Arithmetic, TIES, DARE) and domain generalization (Fish, Fishr, GGA) to evaluate performance on newly constructed unseen CS language pair datasets.
In practice
- Construct small-scale evaluation datasets for under-resourced language pairs.
- Consider TIES-Merging for combining language-pair-specific CS-ASR models.
- Focus CS-ASR adaptation on deeper encoder and decoder layers.
Topics
- Code-Switching ASR
- Multilingual ASR
- Model Merging
- Domain Generalization
- Whisper-medium
- Speech Datasets
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.