Reinforcement Learning with Semantic Rewards Enables Low-Resource Language Expansion without Alignment Tax
Summary
A new research paper introduces a semantic-space alignment paradigm for extending large language models (LLMs) to low-resource languages without incurring a significant "alignment tax." This tax, typically seen with supervised fine-tuning (SFT), involves catastrophic forgetting of general capabilities when improving performance in a target language. The proposed method utilizes Group Relative Policy Optimization (GRPO) with embedding-level semantic rewards and a language consistency reward, rather than token-level likelihood maximization. Evaluated on Tibetan–Chinese machine translation and Tibetan headline generation using the Qwen3-4B model with LoRA, the approach demonstrates superior preservation of general competence (e.g., on the Chinese CMRC benchmark) compared to SFT. Despite sometimes showing lower n-gram overlap, the semantic RL model achieves higher semantic quality and user preference in open-ended generation, and learns more transferable representations for few-shot tasks.
Key takeaway
For AI Engineers and Research Scientists developing LLMs for low-resource languages, consider adopting a semantic-space alignment strategy with reinforcement learning. This approach, exemplified by GRPO with embedding-level semantic rewards, can significantly reduce catastrophic forgetting of general capabilities ("alignment tax") compared to traditional supervised fine-tuning. Your models will likely achieve higher semantic quality and better transferability to new tasks, even if reference-based metrics show lower surface-level overlap.
Key insights
Semantic-space alignment with reinforcement learning mitigates "alignment tax" in low-resource language expansion.
Principles
- Prioritize meaning preservation over surface-form imitation.
- Constrained policy updates reduce catastrophic forgetting.
- Semantic rewards enable flexible linguistic realizations.
Method
A two-stage training paradigm: cold-start SFT for minimal competence, followed by GRPO with embedding-level semantic similarity and language consistency rewards for meaning preservation.
In practice
- Use multilingual sentence embedding models for semantic rewards.
- Apply a threshold-and-rescale function to semantic rewards.
- Incorporate rule-based language consistency checks.
Topics
- Reinforcement Learning
- Semantic Rewards
- Low-Resource Language Expansion
- Alignment Tax Mitigation
- Group Relative Policy Optimization
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.