Hierarchical Policy Optimization for Simultaneous Translation of Unbounded Speech
Summary
Hierarchical Policy Optimization (HPO) is a novel post-training approach designed to enhance simultaneous speech translation (SST) models, particularly those initially trained on imperfect supervised fine-tuning (SFT) data. This method addresses the high computational overhead of using large language models (LLMs) for SST by reformulating it as a multi-turn dialogue task, which allows for full reuse of the LLM's key-value (KV) cache. HPO introduces a hierarchical reward system that balances translation quality and latency, ensuring quality is prioritized before optimizing for speed. Experiments on English to Chinese, German, and Japanese translation tasks, using the ACL 60/60 development set and RealSI, show HPO improves COMET scores by over +7 and MetricX scores by +1.25 at a 1.5-second latency, outperforming strong baselines like SFT-only and SeqPO-SiMT.
Key takeaway
For research scientists developing simultaneous speech translation systems, HPO offers a robust post-training methodology to significantly improve both translation quality and latency. By implementing the hierarchical reward structure and SEGALE segmentation, you can mitigate issues from imperfect SFT data, achieving superior performance on unbounded speech. Consider integrating this approach to refine your LLM-based SST models, especially when balancing accuracy and real-time performance is critical.
Key insights
HPO improves simultaneous speech translation by post-training models with a hierarchical reward that prioritizes quality over latency.
Principles
- Prioritize translation quality before optimizing for latency.
- Segment hypotheses into sentences for accurate reward computation.
- Group normalization stabilizes training by balancing reward components.
Method
HPO adapts Group Relative Policy Optimization (GRPO) using a hierarchical reward that sets latency to maximum if quality is below a threshold, then averages and group-normalizes quality and latency scores for policy optimization.
In practice
- Use MetricX as a quality reward for SST.
- Employ SEGALE for robust sentence alignment.
- Apply Attention Sink for unbounded speech streams.
Topics
- Hierarchical Policy Optimization
- Simultaneous Speech Translation
- Large Language Models
- Reinforcement Learning
- Latency-Quality Trade-off
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.