MURMUR: An Efficient Inference System for Long-Form ASR

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Audio and Speech Processing · Depth: Expert, quick

Summary

Murmur is an efficient inference system designed for long-form Automatic Speech Recognition (ASR) that resolves the traditional trade-off between high accuracy and low latency. Existing chunk-based pipelines offer low latency but lose cross-chunk context, while long-context ASR models provide better accuracy but are significantly slower. Murmur operates at two levels: an inter-chunk level, where chunk size is a tunable hyperparameter for balancing accuracy and latency, and an intra-chunk level, which exploits attention sparsity using a sliding window KV cache eviction policy for both output and speech tokens. On the AMI-IHM benchmark, Murmur achieved single-pass accuracy while reducing latency by 4.2x, with additional gains from token eviction causing less than 1% relative tcpWER degradation.

Key takeaway

For Machine Learning Engineers deploying long-form ASR systems, Murmur presents a validated architecture to achieve high accuracy without compromising on latency. You should investigate its two-level inference approach, particularly the tunable chunk sizes and the sliding window KV cache eviction, to optimize your model deployments. This method can significantly improve real-time performance and reduce operational costs for demanding speech processing applications.

Key insights

Murmur efficiently balances long-form ASR accuracy and latency using a two-level inference system.

Principles

Method

Murmur employs a two-level approach: inter-chunk processing with tunable chunk sizes and intra-chunk attention sparsity via a sliding window KV cache eviction policy for output and speech tokens.

In practice

Topics

Code references

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.