Pushing the Limits of On-Device Streaming ASR: A Compact, High-Accuracy English Model for Low-Latency Inference

2026-04-16 · Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, medium

Summary

A systematic empirical study evaluated over 50 configurations of state-of-the-art Automatic Speech Recognition (ASR) architectures, including encoder-decoder, transducer, and LLM-based paradigms, for on-device streaming. The research identified NVIDIA's Nemotron Speech Streaming as the most promising candidate for real-time English streaming on resource-constrained hardware. Researchers re-implemented the streaming inference pipeline in ONNX Runtime and applied post-training quantization strategies, such as int4 k-quant, mixed-precision, and round-to-nearest quantization, combined with graph-level operator fusion. These optimizations reduced the model size from 2.47 GB to 0.67 GB while maintaining a Word Error Rate (WER) within 1% absolute of the full-precision PyTorch baseline. The recommended int4 k-quant configuration achieved an 8.20% average streaming WER across eight benchmarks, running faster than real-time on CPU with 0.56 s algorithmic latency.

Key takeaway

For NLP engineers developing on-device streaming ASR solutions, this research demonstrates that achieving high accuracy with low latency on CPU is feasible. You should consider NVIDIA Nemotron Speech Streaming as a strong baseline and integrate post-training quantization, specifically int4 k-quant, and ONNX Runtime for significant model size reduction and faster-than-real-time inference. This approach can establish a new quality-efficiency Pareto point for your applications.

Key insights

On-device ASR can achieve high accuracy and low latency on CPU through systematic benchmarking and quantization.

Principles

Quantization significantly reduces model size.
Operator fusion enhances inference efficiency.
Systematic benchmarking identifies optimal architectures.

Method

The method involves benchmarking ASR architectures, re-implementing the pipeline in ONNX Runtime, and applying post-training quantization (e.g., int4 k-quant) with graph-level operator fusion to optimize for size and speed.

In practice

Use NVIDIA Nemotron Speech Streaming for on-device ASR.
Apply int4 k-quantization for model size reduction.
Implement ONNX Runtime for CPU inference optimization.

Topics

On-Device ASR
Streaming ASR
Model Quantization
Low-Latency Inference
NVIDIA Nemotron Speech Streaming

Best for: NLP Engineer, Research Scientist, Machine Learning Engineer, AI Engineer, AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.