Pushing the Limits of On-Device Streaming ASR: A Compact, High-Accuracy English Model for Low-Latency Inference
Summary
A systematic empirical study evaluated over 50 configurations of state-of-the-art Automatic Speech Recognition (ASR) architectures, including encoder-decoder, transducer, and LLM-based paradigms, for on-device streaming. The research identified NVIDIA's Nemotron Speech Streaming as the most promising candidate for real-time English streaming on resource-constrained hardware. Researchers re-implemented the streaming inference pipeline in ONNX Runtime and applied post-training quantization strategies, such as int4 k-quant, mixed-precision, and round-to-nearest quantization, combined with graph-level operator fusion. These optimizations reduced the model size from 2.47 GB to 0.67 GB while maintaining a Word Error Rate (WER) within 1% absolute of the full-precision PyTorch baseline. The recommended int4 k-quant configuration achieved an 8.20% average streaming WER across eight benchmarks, running faster than real-time on CPU with 0.56 s algorithmic latency.
Key takeaway
For NLP engineers developing on-device streaming ASR solutions, this research demonstrates that achieving high accuracy with low latency on CPU is feasible. You should consider NVIDIA Nemotron Speech Streaming as a strong baseline and integrate post-training quantization, specifically int4 k-quant, and ONNX Runtime for significant model size reduction and faster-than-real-time inference. This approach can establish a new quality-efficiency Pareto point for your applications.
Key insights
On-device ASR can achieve high accuracy and low latency on CPU through systematic benchmarking and quantization.
Principles
- Quantization significantly reduces model size.
- Operator fusion enhances inference efficiency.
- Systematic benchmarking identifies optimal architectures.
Method
The method involves benchmarking ASR architectures, re-implementing the pipeline in ONNX Runtime, and applying post-training quantization (e.g., int4 k-quant) with graph-level operator fusion to optimize for size and speed.
In practice
- Use NVIDIA Nemotron Speech Streaming for on-device ASR.
- Apply int4 k-quantization for model size reduction.
- Implement ONNX Runtime for CPU inference optimization.
Topics
- On-Device ASR
- Streaming ASR
- Model Quantization
- Low-Latency Inference
- NVIDIA Nemotron Speech Streaming
Best for: NLP Engineer, Research Scientist, Machine Learning Engineer, AI Engineer, AI Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.