MURMUR: An Efficient Inference System for Long-Form ASR

2026-05-31 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Audio and Speech Processing · Depth: Expert, quick

Summary

Murmur is an efficient inference system designed for long-form Automatic Speech Recognition (ASR) that resolves the traditional trade-off between high accuracy and low latency. Existing chunk-based pipelines offer low latency but lose cross-chunk context, while long-context ASR models provide better accuracy but are significantly slower. Murmur operates at two levels: an inter-chunk level, where chunk size is a tunable hyperparameter for balancing accuracy and latency, and an intra-chunk level, which exploits attention sparsity using a sliding window KV cache eviction policy for both output and speech tokens. On the AMI-IHM benchmark, Murmur achieved single-pass accuracy while reducing latency by 4.2x, with additional gains from token eviction causing less than 1% relative tcpWER degradation.

Key takeaway

For Machine Learning Engineers deploying long-form ASR systems, Murmur presents a validated architecture to achieve high accuracy without compromising on latency. You should investigate its two-level inference approach, particularly the tunable chunk sizes and the sliding window KV cache eviction, to optimize your model deployments. This method can significantly improve real-time performance and reduce operational costs for demanding speech processing applications.

Key insights

Murmur efficiently balances long-form ASR accuracy and latency using a two-level inference system.

Principles

Intermediate chunk sizes optimize ASR accuracy and latency.
Attention sparsity can be exploited for inference efficiency.
Sliding window KV cache eviction improves token processing.

Method

Murmur employs a two-level approach: inter-chunk processing with tunable chunk sizes and intra-chunk attention sparsity via a sliding window KV cache eviction policy for output and speech tokens.

In practice

Tune ASR chunk sizes for specific performance needs.
Implement KV cache eviction for long-context models.
Apply attention sparsity techniques to reduce latency.

Topics

Long-Form ASR
Inference Systems
KV Cache Eviction
Attention Sparsity
Speech Processing
Low Latency

Code references

uw-syfi/Murmur

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.