Reducing the Offline-Streaming Gap for Unified ASR Transducer with Consistency Regularization

2026-04-21 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Audio and Speech Processing, Computation and Language · Depth: Expert, quick

Summary

A new Unified ASR framework for Transducer (RNNT) training has been developed to reduce the performance gap between offline and streaming automatic speech recognition. This framework supports both decoding modes within a single model by employing chunk-limited attention with right context and dynamic chunked convolutions. To further enhance performance consistency, the researchers introduced an efficient Triton implementation of mode-consistency regularization for RNNT (MCR-RNNT), which promotes agreement across different training modes. Experiments demonstrate that this approach improves streaming accuracy at low latency without compromising offline performance, and it scales effectively to larger model sizes and training datasets. The Unified ASR framework and its English model checkpoint are open-sourced.

Key takeaway

For AI Engineers developing ASR systems that require both offline and low-latency streaming capabilities, this Unified ASR framework offers a robust solution. You should consider integrating chunk-limited attention and mode-consistency regularization (MCR-RNNT) into your Transducer models to achieve improved streaming accuracy without sacrificing offline performance. The open-sourced framework provides a practical starting point for implementation.

Key insights

A unified RNNT framework and mode-consistency regularization improve ASR streaming accuracy while preserving offline performance.

Principles

Unifying ASR reduces development costs.
Chunk-limited attention improves streaming.
Consistency regularization enhances mode agreement.

Method

The method uses chunk-limited attention with right context and dynamic chunked convolutions for unified ASR. It integrates MCR-RNNT via an efficient Triton implementation to encourage agreement across training modes.

In practice

Implement chunk-limited attention for streaming ASR.
Apply mode-consistency regularization for unified models.
Utilize Triton for efficient regularization.

Topics

Unified ASR
RNNT
Consistency Regularization
Streaming Accuracy
Chunk-limited Attention

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.