Hierarchical Policy Optimization for Simultaneous Translation of Unbounded Speech

2026-04-24 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, extended

Summary

Hierarchical Policy Optimization (HPO) is a novel post-training approach designed to enhance simultaneous speech translation (SST) models, particularly those initially trained on imperfect supervised fine-tuning (SFT) data. This method addresses the high computational overhead of using large language models (LLMs) for SST by reformulating it as a multi-turn dialogue task, which allows for full reuse of the LLM's key-value (KV) cache. HPO introduces a hierarchical reward system that balances translation quality and latency, ensuring quality is prioritized before optimizing for speed. Experiments on English to Chinese, German, and Japanese translation tasks, using the ACL 60/60 development set and RealSI, show HPO improves COMET scores by over +7 and MetricX scores by +1.25 at a 1.5-second latency, outperforming strong baselines like SFT-only and SeqPO-SiMT.

Key takeaway

For research scientists developing simultaneous speech translation systems, HPO offers a robust post-training methodology to significantly improve both translation quality and latency. By implementing the hierarchical reward structure and SEGALE segmentation, you can mitigate issues from imperfect SFT data, achieving superior performance on unbounded speech. Consider integrating this approach to refine your LLM-based SST models, especially when balancing accuracy and real-time performance is critical.

Key insights

HPO improves simultaneous speech translation by post-training models with a hierarchical reward that prioritizes quality over latency.

Principles

Prioritize translation quality before optimizing for latency.
Segment hypotheses into sentences for accurate reward computation.
Group normalization stabilizes training by balancing reward components.

Method

HPO adapts Group Relative Policy Optimization (GRPO) using a hierarchical reward that sets latency to maximum if quality is below a threshold, then averages and group-normalizes quality and latency scores for policy optimization.

In practice

Use MetricX as a quality reward for SST.
Employ SEGALE for robust sentence alignment.
Apply Attention Sink for unbounded speech streams.

Topics

Hierarchical Policy Optimization
Simultaneous Speech Translation
Large Language Models
Reinforcement Learning
Latency-Quality Trade-off

Code references

owaski/HPO

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.