Data Scale, Not Latency, Shapes Cross-Lingual Encoder Transfer in Streaming ASR

2026-06-23 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A study on cross-lingual encoder transfer in streaming Automatic Speech Recognition (ASR) reveals that multilingual (ML) initialization offers a data-limited advantage, not a latency-limited one. Researchers conducted a controlled sweep using a 0.6 B-parameter cache-aware FastConformer transducer across eight European languages, evaluating data scales from 100 h to 2500 h, three streaming tiers, and offline decoding. The main finding indicates that on FLEURS at 160 ms, the English-only (EN) versus ML word error rate (WER) gap decreases significantly from +4.21 percentage points (pp) at 100 h to +0.20 pp at 2500 h, with each data doubling roughly halving the remaining advantage. This gap remained stable across streaming tiers from 100 h to 1000 h and was negligible by 2500 h. Additionally, 4-bit weight-only encoder quantization at the 560 ms streaming tier reduced the encoder footprint by approximately 3x, incurring an average FLEURS WER increase of about 0.5 pp.

Key takeaway

For Machine Learning Engineers adapting streaming ASR models to new languages, prioritize multilingual encoder initialization when target language data is scarce (e.g., 100 h). As your data scales increase towards 2500 h, the choice between multilingual and English-only initialization becomes effectively irrelevant, as the performance gap diminishes. You should make streaming latency and quantization decisions independently, as 4-bit weight-only quantization can reduce encoder footprint by 3x with only a ~0.5 pp WER increase.

Key insights

Multilingual initialization benefits streaming ASR primarily in low-data regimes, not due to latency constraints.

Principles

Multilingual initialization advantage decays with increasing target-language data.
Streaming latency does not significantly amplify multilingual encoder benefits.
Quantization decisions can be made independently of initialization choice.

Method

Controlled sweep of a 0.6 B-parameter FastConformer transducer across eight European languages, varying data scales (100-2500 h), streaming tiers, and test sets.

In practice

Use multilingual initialization for low-data ASR language adaptation.
Consider English-only initialization for high-data ASR language adaptation.
Apply 4-bit quantization to reduce encoder footprint by ~3x with minimal WER impact.

Topics

Streaming ASR
Cross-lingual Transfer
Encoder Initialization
Data Scale
Word Error Rate
Model Quantization
FastConformer Transducer

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.