Improving Cross-Lingual Factual Recall via Consistency-Driven Reinforcement Learning
Summary
Research introduces PolyFact, a large-scale parallel multilingual factual QA dataset containing 100K Wikidata-grounded facts across 12 typologically diverse languages. This dataset addresses the challenge of cross-lingual factual inconsistency in large language models (LLMs) trained primarily on English data. The study compared light continual pretraining (CPT), supervised fine-tuning (SFT), and reinforcement learning via Group Relative Policy Optimization (GRPO) for enhancing cross-lingual factual recall in Qwen-2.5-7B and OLMo-2-1124-7B. Findings indicate that GRPO consistently outperforms SFT, significantly improving both cross-lingual consistency and generalization to previously unseen languages, while CPT on parallel data offered limited additional benefits. Mechanistic analyses revealed GRPO reorganizes multilingual routing by reducing language specialization in MLP layers and attention heads, fostering more shared cross-lingual representations.
Key takeaway
For Machine Learning Engineers developing multilingual LLMs, this research suggests prioritizing reinforcement learning approaches like GRPO over traditional supervised fine-tuning. You should consider implementing GRPO to significantly improve cross-lingual factual recall and consistency, especially when aiming for generalization to unseen languages. Utilizing the released PolyFact dataset can also provide a robust benchmark for evaluating your model's multilingual capabilities.
Key insights
GRPO significantly improves cross-lingual factual recall and consistency in LLMs by fostering shared representations, outperforming SFT.
Principles
- GRPO improves cross-lingual consistency.
- Shared representations reduce language specialization.
- SFT is less effective than GRPO.
Method
The method involves comparing continual pretraining, supervised fine-tuning, and Group Relative Policy Optimization (GRPO) on the PolyFact dataset to enhance cross-lingual factual recall and consistency in LLMs.
In practice
- Apply GRPO for multilingual LLM fine-tuning.
- Utilize PolyFact for cross-lingual QA tasks.
- Prioritize shared cross-lingual representations.
Topics
- Large Language Models
- Cross-Lingual QA
- Reinforcement Learning
- GRPO
- PolyFact Dataset
- Factual Consistency
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.