MURMUR: An Efficient Inference System for Long-Form ASR
Summary
Murmur is an efficient inference system designed for long-form Automatic Speech Recognition (ASR) that resolves the traditional trade-off between high accuracy and low latency. Existing chunk-based pipelines offer low latency but lose cross-chunk context, while long-context ASR models provide better accuracy but are significantly slower. Murmur operates at two levels: an inter-chunk level, where chunk size is a tunable hyperparameter for balancing accuracy and latency, and an intra-chunk level, which exploits attention sparsity using a sliding window KV cache eviction policy for both output and speech tokens. On the AMI-IHM benchmark, Murmur achieved single-pass accuracy while reducing latency by 4.2x, with additional gains from token eviction causing less than 1% relative tcpWER degradation.
Key takeaway
For Machine Learning Engineers deploying long-form ASR systems, Murmur presents a validated architecture to achieve high accuracy without compromising on latency. You should investigate its two-level inference approach, particularly the tunable chunk sizes and the sliding window KV cache eviction, to optimize your model deployments. This method can significantly improve real-time performance and reduce operational costs for demanding speech processing applications.
Key insights
Murmur efficiently balances long-form ASR accuracy and latency using a two-level inference system.
Principles
- Intermediate chunk sizes optimize ASR accuracy and latency.
- Attention sparsity can be exploited for inference efficiency.
- Sliding window KV cache eviction improves token processing.
Method
Murmur employs a two-level approach: inter-chunk processing with tunable chunk sizes and intra-chunk attention sparsity via a sliding window KV cache eviction policy for output and speech tokens.
In practice
- Tune ASR chunk sizes for specific performance needs.
- Implement KV cache eviction for long-context models.
- Apply attention sparsity techniques to reduce latency.
Topics
- Long-Form ASR
- Inference Systems
- KV Cache Eviction
- Attention Sparsity
- Speech Processing
- Low Latency
Code references
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.