Reducing the Offline-Streaming Gap for Unified ASR Transducer with Consistency Regularization

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Audio and Speech Processing, Computation and Language · Depth: Expert, quick

Summary

A new Unified ASR framework for Transducer (RNNT) training has been developed to reduce the performance gap between offline and streaming automatic speech recognition. This framework supports both decoding modes within a single model by employing chunk-limited attention with right context and dynamic chunked convolutions. To further enhance performance consistency, the researchers introduced an efficient Triton implementation of mode-consistency regularization for RNNT (MCR-RNNT), which promotes agreement across different training modes. Experiments demonstrate that this approach improves streaming accuracy at low latency without compromising offline performance, and it scales effectively to larger model sizes and training datasets. The Unified ASR framework and its English model checkpoint are open-sourced.

Key takeaway

For AI Engineers developing ASR systems that require both offline and low-latency streaming capabilities, this Unified ASR framework offers a robust solution. You should consider integrating chunk-limited attention and mode-consistency regularization (MCR-RNNT) into your Transducer models to achieve improved streaming accuracy without sacrificing offline performance. The open-sourced framework provides a practical starting point for implementation.

Key insights

A unified RNNT framework and mode-consistency regularization improve ASR streaming accuracy while preserving offline performance.

Principles

Method

The method uses chunk-limited attention with right context and dynamic chunked convolutions for unified ASR. It integrates MCR-RNNT via an efficient Triton implementation to encourage agreement across training modes.

In practice

Topics

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.