Kyutai Releases Hibiki-Zero: A3B Parameter Simultaneous Speech-to-Speech Translation Model Using GRPO Reinforcement Learning Without Any Word-Level Aligned Data

2026-02-13 · Source: Machine Learning ML & Generative AI News · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Advanced, quick

Summary

Kyutai has released Hibiki-Zero, a 3B parameter, decoder-only model for simultaneous speech-to-speech (S2ST) and speech-to-text (S2TT) translation. This model eliminates the need for word-level aligned training data by employing a multistream RQ-Transformer architecture and the streaming Mimi audio codec. Hibiki-Zero jointly models source audio, target audio, and an "inner monologue" text stream at a 12.5 Hz framerate. Its training pipeline uses coarse sentence-level alignments, followed by a novel reinforcement learning strategy called Group Relative Policy Optimization (GRPO) with BLEU-based process rewards. This approach optimizes the balance between translation quality and latency, achieving strong results in accuracy, naturalness, and cross-lingual speaker similarity across five language tasks. The model also demonstrates adaptability to new languages, such as Italian, with less than 1,000 hours of data.

Key takeaway

For NLP engineers developing real-time speech translation systems, Hibiki-Zero offers a robust approach to achieve high-quality, low-latency S2ST without extensive word-level aligned data. You should investigate its GRPO-based reinforcement learning strategy and multistream RQ-Transformer architecture to potentially reduce data annotation burdens and improve performance in simultaneous translation tasks.

Key insights

Hibiki-Zero enables simultaneous speech translation without word-level alignments using GRPO reinforcement learning.

Principles

Jointly model audio and text streams.
Optimize quality-latency trade-offs with RL.

Method

Utilize a multistream RQ-Transformer with Mimi audio codec, train with coarse sentence-level alignments, then fine-tune using GRPO and BLEU-based process rewards.

In practice

Apply GRPO for S2ST latency-quality optimization.
Use streaming Mimi codec for real-time audio processing.

Topics

Speech-to-Speech Translation
Simultaneous Translation
Reinforcement Learning
GRPO
RQ-Transformer

Code references

kyutai-labs/hibiki-zero

Best for: NLP Engineer, Research Scientist, AI Engineer, Machine Learning Engineer, AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning ML & Generative AI News.