Kyutai Releases Hibiki-Zero: A3B Parameter Simultaneous Speech-to-Speech Translation Model Using GRPO Reinforcement Learning Without Any Word-Level Aligned Data

· Source: Machine Learning ML & Generative AI News · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Advanced, quick

Summary

Kyutai has released Hibiki-Zero, a 3B parameter, decoder-only model for simultaneous speech-to-speech (S2ST) and speech-to-text (S2TT) translation. This model eliminates the need for word-level aligned training data by employing a multistream RQ-Transformer architecture and the streaming Mimi audio codec. Hibiki-Zero jointly models source audio, target audio, and an "inner monologue" text stream at a 12.5 Hz framerate. Its training pipeline uses coarse sentence-level alignments, followed by a novel reinforcement learning strategy called Group Relative Policy Optimization (GRPO) with BLEU-based process rewards. This approach optimizes the balance between translation quality and latency, achieving strong results in accuracy, naturalness, and cross-lingual speaker similarity across five language tasks. The model also demonstrates adaptability to new languages, such as Italian, with less than 1,000 hours of data.

Key takeaway

For NLP engineers developing real-time speech translation systems, Hibiki-Zero offers a robust approach to achieve high-quality, low-latency S2ST without extensive word-level aligned data. You should investigate its GRPO-based reinforcement learning strategy and multistream RQ-Transformer architecture to potentially reduce data annotation burdens and improve performance in simultaneous translation tasks.

Key insights

Hibiki-Zero enables simultaneous speech translation without word-level alignments using GRPO reinforcement learning.

Principles

Method

Utilize a multistream RQ-Transformer with Mimi audio codec, train with coarse sentence-level alignments, then fine-tune using GRPO and BLEU-based process rewards.

In practice

Topics

Code references

Best for: NLP Engineer, Research Scientist, AI Engineer, Machine Learning Engineer, AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning ML & Generative AI News.