Translate-R1: Cost-Aware Translation Tool Use via Reinforcement Learning

2026-06-08 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

Translate-R1, a novel reinforcement learning (RL) policy developed by Amazon Stores Foundation AI, enables large language models (LLMs) to intelligently decide when to use translation tools, optimizing cost and performance across diverse languages and domains. The policy, trained on the post-trained Qwen3-4B model across 22 languages in 3 resource tiers and 5 domains, utilizes confidence-gated GSPO for cost-sensitive tool use. It achieved reward lifts of +4.6 on High, +23.5 on Low, and +17.5 on XLow resource languages. Compared to an unconstrained policy, Translate-R1 preserves full reward at 63% of the cost and is Pareto-optimal across 87% of the cost-sensitivity range. It also improved by +18.7 on two synthetic, completely unseen languages and transferred zero-shot to 9 held-out languages, validated by an answer-preserving translation pipeline with 98.4% fidelity.

Key takeaway

For AI scientists and ML engineers deploying multilingual LLMs, integrating a confidence-gated reinforcement learning policy like Translate-R1 is crucial. This approach allows your models to adaptively use translation tools, significantly reducing unnecessary costs for high-resource languages while preserving performance gains for low-resource and unseen languages. You should consider this method to achieve Pareto-optimal cost-performance trade-offs in your multilingual applications, ensuring efficient resource allocation without sacrificing accuracy.

Key insights

A learned policy allows LLMs to introspect their comprehension and invoke translation tools only when necessary, balancing performance and cost.

Principles

LLMs are overconfident, underutilizing translation tools for low-resource languages.
Cost-sensitive tool use requires adaptive mechanisms to prevent over-suppression.
Reward-based learning fosters language- and domain-adaptive introspection.

Method

Continue RL on Qwen3-4B using confidence-gated GSPO, leveraging an answer-preserving translation pipeline for multilingual verifiable rewards.

In practice

Implement confidence-gated GSPO for efficient, cost-aware LLM tool use.
Utilize back-translation for robust answer verification in multilingual RLVR.

Topics

Reinforcement Learning
Multilingual LLMs
Tool Use
Cost Optimization
Qwen3-4B
GSPO
Language Adaptation

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.