Streaming Speech-to-Text Translation with a SpeechLLM
Summary
This paper introduces a novel LLM-based architecture, the "intermixed" SpeechLLM, for real streaming speech-to-text translation. Unlike existing SpeechLLM systems that rely on fixed wait policies or wait for complete utterances, this model learns to adaptively decide when to emit output tokens based on sufficient audio input. The system is trained using automatic phrase-level alignments of input speech and target text, a method proposed to overcome challenges with word-level alignments in disparate languages like English and Korean. Experiments on English-to-French and English-to-Korean language pairs demonstrate that the intermixed SpeechLLM achieves translation quality comparable to non-streaming baselines, but with significantly lower latency of only 1-2 seconds. Furthermore, the paper proposes an "early-exit wait policy" variant to reduce energy consumption on devices by minimizing LLM calls without compromising translation quality, and shows robustness against acoustic variations like prepended silence, which cause catastrophic failures in fixed-policy systems.
Key takeaway
Research Scientists developing real-time speech-to-text translation systems should consider adopting the intermixed SpeechLLM architecture. This approach offers superior latency-quality trade-offs (1-2 seconds latency with comparable COMET scores) and robustness to real-world acoustic variations compared to fixed-policy systems. You can also integrate the early-exit wait policy to optimize energy consumption for on-device deployment without sacrificing translation accuracy.
Key insights
A novel SpeechLLM adaptively decides when to emit translation tokens, achieving low latency and high quality.
Principles
- Adaptive streaming outperforms fixed-policy systems.
- Phrase-level alignments improve translation quality for diverse languages.
Method
The "intermixed" SpeechLLM integrates a learned wait policy directly into the LLM, outputting wait tokens to request more audio. An "early-exit wait policy" further reduces energy by using a faster, less sophisticated head to decide when to evaluate the full LLM.
In practice
- Implement phrase-level alignments for training streaming translation.
- Use an early-exit policy to balance latency and energy use on devices.
Topics
- Streaming Speech-to-Text Translation
- SpeechLLM Architecture
- Adaptive Wait Policy
- Phrase-Level Alignment
- Energy Efficiency
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.