AlignAtt4LLM: Fast AlignAtt for Decoder-Only LLMs at IWSLT 2026 Simultaneous Speech Translation Task
Summary
AlignAtt4LLM is a simultaneous speech translation system developed for the IWSLT 2026 task, translating English into German, Italian, and Chinese. This system employs a synchronous cascade, utilizing Qwen3-ASR for incrementally updated source transcripts via forced alignment, and Gemma-4 E4B-it for translating these prefixes under an MT-side AlignAtt policy. Notably, this marks the first application of AlignAtt to a decoder-only LLM, addressing the absence of encoder-decoder cross-attention found in prior AlignAtt systems. The approach recovers a usable policy through four key proposals: an explicit source span in the prompt, offline selection of translation-specific alignment heads, selective qk-fast replay of the draft-to-source attention block, and runtime query/key capture that preserves bit-identical model outputs. On the IWSLT 2026 development set, AlignAtt4LLM surpassed supplied baselines for English to German and Italian in both low-latency (around 2 seconds) and high-latency (below 4 seconds CU-LongYAAL) regimes, though results for English to Chinese were mixed. The method is generalizable, requiring only a deterministic prompt layout, calibrated attention heads, and query/key capture for reapplication to other decoder-only MT backbones.
Key takeaway
For NLP Engineers developing simultaneous speech translation systems, AlignAtt4LLM demonstrates a viable path to adapt advanced alignment policies to decoder-only LLMs like Gemma-4. You should consider integrating prompt-based source span definition and selective attention head calibration to overcome the lack of traditional encoder-decoder cross-attention. This approach offers competitive latency and accuracy for European languages, suggesting a robust framework for extending real-time translation capabilities with modern LLM architectures.
Key insights
Adapting AlignAtt for decoder-only LLMs enables simultaneous speech translation by re-engineering attention mechanisms.
Principles
- Explicit source span improves decoder-only LLM alignment.
- Selective attention head use enhances translation quality.
- Query/key capture ensures output fidelity.
Method
The system uses Qwen3-ASR for incremental transcription and Gemma-4 E4B-it for translation, applying an AlignAtt policy via prompt-based source span, selected attention heads, qk-fast replay, and runtime query/key capture.
In practice
- Apply AlignAtt to Gemma-4 E4B-it for simultaneous MT.
- Use prompt engineering for source span definition.
- Select specific attention heads for translation tasks.
Topics
- Simultaneous Speech Translation
- Decoder-Only LLMs
- AlignAtt Policy
- Gemma-4 E4B-it
- Qwen3-ASR
- Low-Latency MT
Best for: Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.