Rethinking Cross-lingual Gaps from a Statistical Viewpoint
Summary
A study by Google DeepMind and Google Research re-examines the "cross-lingual gap" in Large Language Models (LLMs), which describes the accuracy drop when knowledge is queried in a target language compared to its source. Challenging prior assumptions of knowledge barriers or representation misalignment, the research hypothesizes that increased response variance in the target language is the dominant cause. This phenomenon is formalized using bias-variance decomposition. Extensive experiments across five LLMs, including Gemini-2.5-Flash, Gemini-2.5-Pro, GPT-5, GPT-5-mini, and Deepseek-R1, on ECLeKTic and MMLU (with mixup) benchmarks, provide evidence for this hypothesis. The study demonstrates that inference-time interventions, such as response ensembling (sampling ten responses per example) and input ensembling (e.g., Translate-then-Answer), effectively reduce this gap. A simple prompt instruction was shown to improve target accuracy by 20-25% across different models, further indicating that reducing response variance is key to mitigating cross-lingual performance disparities.
Key takeaway
For Machine Learning Engineers deploying multilingual LLMs, if you observe performance disparities across languages, your focus should shift from complex pretraining adjustments to simpler inference-time interventions. Implement response ensembling by sampling multiple outputs or use prompt instructions like "Translate-then-Answer" to reduce response variance. This approach can significantly improve target language accuracy by 20-25% without requiring extensive model retraining, optimizing resource allocation for multilingual deployments.
Key insights
LLM cross-lingual gaps stem from response variance, not knowledge barriers, and are reducible via variance control.
Principles
- Cross-lingual gaps are variance-driven, not knowledge-driven.
- Source and target response variance are proportional.
- High source confidence reduces cross-lingual gaps.
Method
Formalize cross-lingual gaps via bias-variance decomposition, then apply inference-time interventions like response or input ensembling to reduce target response variance.
In practice
- Sample multiple responses and ensemble them.
- Use prompt instructions for implicit ensembling.
- Focus on improving source language confidence.
Topics
- Cross-lingual Gaps
- Large Language Models
- Response Variance
- Bias-Variance Decomposition
- Inference-Time Interventions
- Prompt Engineering
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.