Improving Cross-Lingual Factual Recall via Consistency-Driven Reinforcement Learning
Summary
Researchers introduced PolyFact, a large-scale parallel multilingual factual QA dataset comprising 100,000 Wikidata-grounded facts across 12 typologically diverse languages. This dataset was used to evaluate methods for improving cross-lingual factual recall in large language models like Qwen-2.5-7B and OLMo-2-1124-7B. Consistency-driven reinforcement learning via Group Relative Policy Optimization (GRPO) consistently outperformed supervised fine-tuning (SFT), enhancing both cross-lingual consistency and generalization to unseen languages. Light continual pretraining (CPT) on parallel data yielded limited additional gains. Mechanistic analyses revealed that GRPO reorganizes multilingual routing by reducing language specialization in MLP layers and attention heads, promoting shared cross-lingual representations instead of surface-level memorization. The code, models, and dataset are open-sourced.
Key takeaway
For AI Scientists and Machine Learning Engineers developing multilingual LLMs, you should prioritize consistency-driven reinforcement learning (GRPO) over supervised fine-tuning to improve cross-lingual factual recall and generalization. GRPO fundamentally restructures internal representations for better knowledge access across languages, whereas SFT often leads to superficial memorization. Evaluate your models on free-form generation tasks like KLAR to ensure genuine cross-lingual retrieval, not just candidate selection.
Key insights
Consistency-driven reinforcement learning improves LLM cross-lingual factual recall by promoting shared internal representations.
Principles
- Cross-lingual inconsistency stems from language transition failures, not missing knowledge.
- GRPO fosters generalizable cross-lingual behavior, unlike SFT's surface-level memorization.
- GRPO delays linguistic specialization, preserving a larger language-agnostic space.
Method
PolyFact, a 100K-fact, 12-language Wikidata-grounded QA dataset, enables consistency-driven RL via GRPO. GRPO uses grouped rollouts with a reward bonus for all-language correctness.
In practice
- Use GRPO for post-training to enhance multilingual factual recall.
- Prioritize consistency-driven RL over SFT for cross-lingual generalization.
- Consider dataset biases, especially for proper nouns, when evaluating multilingual models.
Topics
- Cross-Lingual Factual Recall
- Reinforcement Learning
- Group Relative Policy Optimization
- Multilingual LLMs
- PolyFact Dataset
- Mechanistic Interpretability
Code references
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.