Translate-R1: Cost-Aware Translation Tool Use via Reinforcement Learning
Summary
Translate-R1, a novel reinforcement learning (RL) policy developed by Amazon Stores Foundation AI, enables large language models (LLMs) to intelligently decide when to use translation tools, optimizing cost and performance across diverse languages and domains. The policy, trained on the post-trained Qwen3-4B model across 22 languages in 3 resource tiers and 5 domains, utilizes confidence-gated GSPO for cost-sensitive tool use. It achieved reward lifts of +4.6 on High, +23.5 on Low, and +17.5 on XLow resource languages. Compared to an unconstrained policy, Translate-R1 preserves full reward at 63% of the cost and is Pareto-optimal across 87% of the cost-sensitivity range. It also improved by +18.7 on two synthetic, completely unseen languages and transferred zero-shot to 9 held-out languages, validated by an answer-preserving translation pipeline with 98.4% fidelity.
Key takeaway
For AI scientists and ML engineers deploying multilingual LLMs, integrating a confidence-gated reinforcement learning policy like Translate-R1 is crucial. This approach allows your models to adaptively use translation tools, significantly reducing unnecessary costs for high-resource languages while preserving performance gains for low-resource and unseen languages. You should consider this method to achieve Pareto-optimal cost-performance trade-offs in your multilingual applications, ensuring efficient resource allocation without sacrificing accuracy.
Key insights
A learned policy allows LLMs to introspect their comprehension and invoke translation tools only when necessary, balancing performance and cost.
Principles
- LLMs are overconfident, underutilizing translation tools for low-resource languages.
- Cost-sensitive tool use requires adaptive mechanisms to prevent over-suppression.
- Reward-based learning fosters language- and domain-adaptive introspection.
Method
Continue RL on Qwen3-4B using confidence-gated GSPO, leveraging an answer-preserving translation pipeline for multilingual verifiable rewards.
In practice
- Implement confidence-gated GSPO for efficient, cost-aware LLM tool use.
- Utilize back-translation for robust answer verification in multilingual RLVR.
Topics
- Reinforcement Learning
- Multilingual LLMs
- Tool Use
- Cost Optimization
- Qwen3-4B
- GSPO
- Language Adaptation
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.