Reinforcement Learning with Semantic Rewards Enables Low-Resource Language Expansion without Alignment Tax

2025-12-25 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, extended

Summary

A new research paper introduces a semantic-space alignment paradigm for extending large language models (LLMs) to low-resource languages without incurring a significant "alignment tax." This tax, typically seen with supervised fine-tuning (SFT), involves catastrophic forgetting of general capabilities when improving performance in a target language. The proposed method utilizes Group Relative Policy Optimization (GRPO) with embedding-level semantic rewards and a language consistency reward, rather than token-level likelihood maximization. Evaluated on Tibetan–Chinese machine translation and Tibetan headline generation using the Qwen3-4B model with LoRA, the approach demonstrates superior preservation of general competence (e.g., on the Chinese CMRC benchmark) compared to SFT. Despite sometimes showing lower n-gram overlap, the semantic RL model achieves higher semantic quality and user preference in open-ended generation, and learns more transferable representations for few-shot tasks.

Key takeaway

For AI Engineers and Research Scientists developing LLMs for low-resource languages, consider adopting a semantic-space alignment strategy with reinforcement learning. This approach, exemplified by GRPO with embedding-level semantic rewards, can significantly reduce catastrophic forgetting of general capabilities ("alignment tax") compared to traditional supervised fine-tuning. Your models will likely achieve higher semantic quality and better transferability to new tasks, even if reference-based metrics show lower surface-level overlap.

Key insights

Semantic-space alignment with reinforcement learning mitigates "alignment tax" in low-resource language expansion.

Principles

Prioritize meaning preservation over surface-form imitation.
Constrained policy updates reduce catastrophic forgetting.
Semantic rewards enable flexible linguistic realizations.

Method

A two-stage training paradigm: cold-start SFT for minimal competence, followed by GRPO with embedding-level semantic similarity and language consistency rewards for meaning preservation.

In practice

Use multilingual sentence embedding models for semantic rewards.
Apply a threshold-and-rescale function to semantic rewards.
Incorporate rule-based language consistency checks.

Topics

Reinforcement Learning
Semantic Rewards
Low-Resource Language Expansion
Alignment Tax Mitigation
Group Relative Policy Optimization

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.