Streaming Speech-to-Text Translation with a SpeechLLM

2026-05-15 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Speech & Language Processing · Depth: Expert, extended

Summary

This paper introduces a novel LLM-based architecture, the "intermixed" SpeechLLM, for real streaming speech-to-text translation. Unlike existing SpeechLLM systems that rely on fixed wait policies or wait for complete utterances, this model learns to adaptively decide when to emit output tokens based on sufficient audio input. The system is trained using automatic phrase-level alignments of input speech and target text, a method proposed to overcome challenges with word-level alignments in disparate languages like English and Korean. Experiments on English-to-French and English-to-Korean language pairs demonstrate that the intermixed SpeechLLM achieves translation quality comparable to non-streaming baselines, but with significantly lower latency of only 1-2 seconds. Furthermore, the paper proposes an "early-exit wait policy" variant to reduce energy consumption on devices by minimizing LLM calls without compromising translation quality, and shows robustness against acoustic variations like prepended silence, which cause catastrophic failures in fixed-policy systems.

Key takeaway

Research Scientists developing real-time speech-to-text translation systems should consider adopting the intermixed SpeechLLM architecture. This approach offers superior latency-quality trade-offs (1-2 seconds latency with comparable COMET scores) and robustness to real-world acoustic variations compared to fixed-policy systems. You can also integrate the early-exit wait policy to optimize energy consumption for on-device deployment without sacrificing translation accuracy.

Key insights

A novel SpeechLLM adaptively decides when to emit translation tokens, achieving low latency and high quality.

Principles

Adaptive streaming outperforms fixed-policy systems.
Phrase-level alignments improve translation quality for diverse languages.

Method

The "intermixed" SpeechLLM integrates a learned wait policy directly into the LLM, outputting wait tokens to request more audio. An "early-exit wait policy" further reduces energy by using a faster, less sophisticated head to decide when to evaluate the full LLM.

In practice

Implement phrase-level alignments for training streaming translation.
Use an early-exit policy to balance latency and energy use on devices.

Topics

Streaming Speech-to-Text Translation
SpeechLLM Architecture
Adaptive Wait Policy
Phrase-Level Alignment
Energy Efficiency

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.