Pushing the Limits of On-Device Streaming ASR: A Compact, High-Accuracy English Model for Low-Latency Inference

· Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, medium

Summary

A systematic empirical study evaluated over 50 configurations of state-of-the-art Automatic Speech Recognition (ASR) architectures, including encoder-decoder, transducer, and LLM-based paradigms, for on-device streaming. The research identified NVIDIA's Nemotron Speech Streaming as the most promising candidate for real-time English streaming on resource-constrained hardware. Researchers re-implemented the streaming inference pipeline in ONNX Runtime and applied post-training quantization strategies, such as int4 k-quant, mixed-precision, and round-to-nearest quantization, combined with graph-level operator fusion. These optimizations reduced the model size from 2.47 GB to 0.67 GB while maintaining a Word Error Rate (WER) within 1% absolute of the full-precision PyTorch baseline. The recommended int4 k-quant configuration achieved an 8.20% average streaming WER across eight benchmarks, running faster than real-time on CPU with 0.56 s algorithmic latency.

Key takeaway

For NLP engineers developing on-device streaming ASR solutions, this research demonstrates that achieving high accuracy with low latency on CPU is feasible. You should consider NVIDIA Nemotron Speech Streaming as a strong baseline and integrate post-training quantization, specifically int4 k-quant, and ONNX Runtime for significant model size reduction and faster-than-real-time inference. This approach can establish a new quality-efficiency Pareto point for your applications.

Key insights

On-device ASR can achieve high accuracy and low latency on CPU through systematic benchmarking and quantization.

Principles

Method

The method involves benchmarking ASR architectures, re-implementing the pipeline in ONNX Runtime, and applying post-training quantization (e.g., int4 k-quant) with graph-level operator fusion to optimize for size and speed.

In practice

Topics

Best for: NLP Engineer, Research Scientist, Machine Learning Engineer, AI Engineer, AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.