Kyutai Releases Hibiki-Zero: A3B Parameter Simultaneous Speech-to-Speech Translation Model Using GRPO Reinforcement Learning Without Any Word-Level Aligned Data
Summary
Kyutai has released Hibiki-Zero, a 3B parameter, decoder-only model for simultaneous speech-to-speech (S2ST) and speech-to-text (S2TT) translation. This model eliminates the need for word-level aligned training data by employing a multistream RQ-Transformer architecture and the streaming Mimi audio codec. Hibiki-Zero jointly models source audio, target audio, and an "inner monologue" text stream at a 12.5 Hz framerate. Its training pipeline uses coarse sentence-level alignments, followed by a novel reinforcement learning strategy called Group Relative Policy Optimization (GRPO) with BLEU-based process rewards. This approach optimizes the balance between translation quality and latency, achieving strong results in accuracy, naturalness, and cross-lingual speaker similarity across five language tasks. The model also demonstrates adaptability to new languages, such as Italian, with less than 1,000 hours of data.
Key takeaway
For NLP engineers developing real-time speech translation systems, Hibiki-Zero offers a robust approach to achieve high-quality, low-latency S2ST without extensive word-level aligned data. You should investigate its GRPO-based reinforcement learning strategy and multistream RQ-Transformer architecture to potentially reduce data annotation burdens and improve performance in simultaneous translation tasks.
Key insights
Hibiki-Zero enables simultaneous speech translation without word-level alignments using GRPO reinforcement learning.
Principles
- Jointly model audio and text streams.
- Optimize quality-latency trade-offs with RL.
Method
Utilize a multistream RQ-Transformer with Mimi audio codec, train with coarse sentence-level alignments, then fine-tune using GRPO and BLEU-based process rewards.
In practice
- Apply GRPO for S2ST latency-quality optimization.
- Use streaming Mimi codec for real-time audio processing.
Topics
- Speech-to-Speech Translation
- Simultaneous Translation
- Reinforcement Learning
- GRPO
- RQ-Transformer
Code references
Best for: NLP Engineer, Research Scientist, AI Engineer, Machine Learning Engineer, AI Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning ML & Generative AI News.