UMA-Split: unimodal aggregation for both English and Mandarin non-autoregressive speech recognition

2026-06-18 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Speech Recognition · Depth: Expert, long

Summary

UMA-Split introduces a novel non-autoregressive (NAR) speech recognition model designed for both English and Mandarin. The original Unimodal Aggregation (UMA) method, effective for Mandarin, faced challenges with English due to its fine-grained BPE tokens often spanning fewer than three acoustic frames, hindering unimodal weight formation. UMA-Split addresses this by incorporating a "split module" that enables each UMA-aggregated frame to map to two text tokens before computing the Connectionist Temporal Classification (CTC) loss. This enhancement allows the model to effectively handle languages like English. Experiments on LibriSpeech (English) achieved 2.22%/4.93% Word Error Rate (WER) for test clean/other with the 149 M parameter model, matching hybrid CTC/attention autoregressive models while offering a 10x inference speedup. On AISHELL-1 (Mandarin), it achieved a 4.43% Character Error Rate (CER), outperforming other advanced NAR models.

Key takeaway

For Machine Learning Engineers developing non-autoregressive ASR systems for multilingual applications, UMA-Split offers a significant performance and speed advantage. You should consider implementing its split module to overcome unimodal aggregation limitations with fine-grained tokenization, particularly for English. This approach allows your models to achieve autoregressive-level accuracy with a 10x inference speedup, making it ideal for real-time speech processing where both speed and precision are critical.

Key insights

UMA-Split extends unimodal aggregation to English non-autoregressive ASR by allowing aggregated frames to map to multiple fine-grained tokens.

Principles

NAR ASR can match AR performance.
Explicit frame aggregation improves token representation.
Self-conditioned CTC enhances span estimation.

Method

UMA-Split uses convolutional subsampling, high-rate (E-Branchformer) and low-rate (Transformer) encoders, a UMA module for dynamic aggregation, and a split module to generate two tokens per aggregated frame for CTC loss.

In practice

Use a split module for fine-grained tokenization.
Integrate SC-CTC for better token span recognition.
Consider larger BPE vocabularies for English UMA.

Topics

Non-Autoregressive ASR
Unimodal Aggregation
Connectionist Temporal Classification
Speech Recognition
Multilingual ASR
BPE Tokenization

Code references

Audio-WestlakeU/UMA-ASR

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.