UMA-Split: unimodal aggregation for both English and Mandarin non-autoregressive speech recognition
Summary
UMA-Split introduces a novel non-autoregressive (NAR) speech recognition model designed for both English and Mandarin. The original Unimodal Aggregation (UMA) method, effective for Mandarin, faced challenges with English due to its fine-grained BPE tokens often spanning fewer than three acoustic frames, hindering unimodal weight formation. UMA-Split addresses this by incorporating a "split module" that enables each UMA-aggregated frame to map to two text tokens before computing the Connectionist Temporal Classification (CTC) loss. This enhancement allows the model to effectively handle languages like English. Experiments on LibriSpeech (English) achieved 2.22%/4.93% Word Error Rate (WER) for test clean/other with the 149 M parameter model, matching hybrid CTC/attention autoregressive models while offering a 10x inference speedup. On AISHELL-1 (Mandarin), it achieved a 4.43% Character Error Rate (CER), outperforming other advanced NAR models.
Key takeaway
For Machine Learning Engineers developing non-autoregressive ASR systems for multilingual applications, UMA-Split offers a significant performance and speed advantage. You should consider implementing its split module to overcome unimodal aggregation limitations with fine-grained tokenization, particularly for English. This approach allows your models to achieve autoregressive-level accuracy with a 10x inference speedup, making it ideal for real-time speech processing where both speed and precision are critical.
Key insights
UMA-Split extends unimodal aggregation to English non-autoregressive ASR by allowing aggregated frames to map to multiple fine-grained tokens.
Principles
- NAR ASR can match AR performance.
- Explicit frame aggregation improves token representation.
- Self-conditioned CTC enhances span estimation.
Method
UMA-Split uses convolutional subsampling, high-rate (E-Branchformer) and low-rate (Transformer) encoders, a UMA module for dynamic aggregation, and a split module to generate two tokens per aggregated frame for CTC loss.
In practice
- Use a split module for fine-grained tokenization.
- Integrate SC-CTC for better token span recognition.
- Consider larger BPE vocabularies for English UMA.
Topics
- Non-Autoregressive ASR
- Unimodal Aggregation
- Connectionist Temporal Classification
- Speech Recognition
- Multilingual ASR
- BPE Tokenization
Code references
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.