Reducing the Offline-Streaming Gap for Unified ASR Transducer with Consistency Regularization
Summary
A new Unified ASR framework for Transducer (RNNT) training has been developed to reduce the performance gap between offline and streaming automatic speech recognition. This framework supports both decoding modes within a single model by employing chunk-limited attention with right context and dynamic chunked convolutions. To further enhance performance consistency, the researchers introduced an efficient Triton implementation of mode-consistency regularization for RNNT (MCR-RNNT), which promotes agreement across different training modes. Experiments demonstrate that this approach improves streaming accuracy at low latency without compromising offline performance, and it scales effectively to larger model sizes and training datasets. The Unified ASR framework and its English model checkpoint are open-sourced.
Key takeaway
For AI Engineers developing ASR systems that require both offline and low-latency streaming capabilities, this Unified ASR framework offers a robust solution. You should consider integrating chunk-limited attention and mode-consistency regularization (MCR-RNNT) into your Transducer models to achieve improved streaming accuracy without sacrificing offline performance. The open-sourced framework provides a practical starting point for implementation.
Key insights
A unified RNNT framework and mode-consistency regularization improve ASR streaming accuracy while preserving offline performance.
Principles
- Unifying ASR reduces development costs.
- Chunk-limited attention improves streaming.
- Consistency regularization enhances mode agreement.
Method
The method uses chunk-limited attention with right context and dynamic chunked convolutions for unified ASR. It integrates MCR-RNNT via an efficient Triton implementation to encourage agreement across training modes.
In practice
- Implement chunk-limited attention for streaming ASR.
- Apply mode-consistency regularization for unified models.
- Utilize Triton for efficient regularization.
Topics
- Unified ASR
- RNNT
- Consistency Regularization
- Streaming Accuracy
- Chunk-limited Attention
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.