Data Scale, Not Latency, Shapes Cross-Lingual Encoder Transfer in Streaming ASR
Summary
A study on cross-lingual encoder transfer in streaming Automatic Speech Recognition (ASR) reveals that multilingual (ML) initialization offers a data-limited advantage, not a latency-limited one. Researchers conducted a controlled sweep using a 0.6 B-parameter cache-aware FastConformer transducer across eight European languages, evaluating data scales from 100 h to 2500 h, three streaming tiers, and offline decoding. The main finding indicates that on FLEURS at 160 ms, the English-only (EN) versus ML word error rate (WER) gap decreases significantly from +4.21 percentage points (pp) at 100 h to +0.20 pp at 2500 h, with each data doubling roughly halving the remaining advantage. This gap remained stable across streaming tiers from 100 h to 1000 h and was negligible by 2500 h. Additionally, 4-bit weight-only encoder quantization at the 560 ms streaming tier reduced the encoder footprint by approximately 3x, incurring an average FLEURS WER increase of about 0.5 pp.
Key takeaway
For Machine Learning Engineers adapting streaming ASR models to new languages, prioritize multilingual encoder initialization when target language data is scarce (e.g., 100 h). As your data scales increase towards 2500 h, the choice between multilingual and English-only initialization becomes effectively irrelevant, as the performance gap diminishes. You should make streaming latency and quantization decisions independently, as 4-bit weight-only quantization can reduce encoder footprint by 3x with only a ~0.5 pp WER increase.
Key insights
Multilingual initialization benefits streaming ASR primarily in low-data regimes, not due to latency constraints.
Principles
- Multilingual initialization advantage decays with increasing target-language data.
- Streaming latency does not significantly amplify multilingual encoder benefits.
- Quantization decisions can be made independently of initialization choice.
Method
Controlled sweep of a 0.6 B-parameter FastConformer transducer across eight European languages, varying data scales (100-2500 h), streaming tiers, and test sets.
In practice
- Use multilingual initialization for low-data ASR language adaptation.
- Consider English-only initialization for high-data ASR language adaptation.
- Apply 4-bit quantization to reduce encoder footprint by ~3x with minimal WER impact.
Topics
- Streaming ASR
- Cross-lingual Transfer
- Encoder Initialization
- Data Scale
- Word Error Rate
- Model Quantization
- FastConformer Transducer
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.