Improving Cross-Lingual Factual Recall via Consistency-Driven Reinforcement Learning

2026-06-08 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing, Data Science & Analytics · Depth: Expert, extended

Summary

Researchers introduced PolyFact, a large-scale parallel multilingual factual QA dataset comprising 100,000 Wikidata-grounded facts across 12 typologically diverse languages. This dataset was used to evaluate methods for improving cross-lingual factual recall in large language models like Qwen-2.5-7B and OLMo-2-1124-7B. Consistency-driven reinforcement learning via Group Relative Policy Optimization (GRPO) consistently outperformed supervised fine-tuning (SFT), enhancing both cross-lingual consistency and generalization to unseen languages. Light continual pretraining (CPT) on parallel data yielded limited additional gains. Mechanistic analyses revealed that GRPO reorganizes multilingual routing by reducing language specialization in MLP layers and attention heads, promoting shared cross-lingual representations instead of surface-level memorization. The code, models, and dataset are open-sourced.

Key takeaway

For AI Scientists and Machine Learning Engineers developing multilingual LLMs, you should prioritize consistency-driven reinforcement learning (GRPO) over supervised fine-tuning to improve cross-lingual factual recall and generalization. GRPO fundamentally restructures internal representations for better knowledge access across languages, whereas SFT often leads to superficial memorization. Evaluate your models on free-form generation tasks like KLAR to ensure genuine cross-lingual retrieval, not just candidate selection.

Key insights

Consistency-driven reinforcement learning improves LLM cross-lingual factual recall by promoting shared internal representations.

Principles

Cross-lingual inconsistency stems from language transition failures, not missing knowledge.
GRPO fosters generalizable cross-lingual behavior, unlike SFT's surface-level memorization.
GRPO delays linguistic specialization, preserving a larger language-agnostic space.

Method

PolyFact, a 100K-fact, 12-language Wikidata-grounded QA dataset, enables consistency-driven RL via GRPO. GRPO uses grouped rollouts with a reward bonus for all-language correctness.

In practice

Use GRPO for post-training to enhance multilingual factual recall.
Prioritize consistency-driven RL over SFT for cross-lingual generalization.
Consider dataset biases, especially for proper nouns, when evaluating multilingual models.

Topics

Cross-Lingual Factual Recall
Reinforcement Learning
Group Relative Policy Optimization
Multilingual LLMs
PolyFact Dataset
Mechanistic Interpretability

Code references

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.