Rethinking Cross-lingual Gaps from a Statistical Viewpoint

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, extended

Summary

A study by Google DeepMind and Google Research re-examines the "cross-lingual gap" in Large Language Models (LLMs), which describes the accuracy drop when knowledge is queried in a target language compared to its source. Challenging prior assumptions of knowledge barriers or representation misalignment, the research hypothesizes that increased response variance in the target language is the dominant cause. This phenomenon is formalized using bias-variance decomposition. Extensive experiments across five LLMs, including Gemini-2.5-Flash, Gemini-2.5-Pro, GPT-5, GPT-5-mini, and Deepseek-R1, on ECLeKTic and MMLU (with mixup) benchmarks, provide evidence for this hypothesis. The study demonstrates that inference-time interventions, such as response ensembling (sampling ten responses per example) and input ensembling (e.g., Translate-then-Answer), effectively reduce this gap. A simple prompt instruction was shown to improve target accuracy by 20-25% across different models, further indicating that reducing response variance is key to mitigating cross-lingual performance disparities.

Key takeaway

For Machine Learning Engineers deploying multilingual LLMs, if you observe performance disparities across languages, your focus should shift from complex pretraining adjustments to simpler inference-time interventions. Implement response ensembling by sampling multiple outputs or use prompt instructions like "Translate-then-Answer" to reduce response variance. This approach can significantly improve target language accuracy by 20-25% without requiring extensive model retraining, optimizing resource allocation for multilingual deployments.

Key insights

LLM cross-lingual gaps stem from response variance, not knowledge barriers, and are reducible via variance control.

Principles

Method

Formalize cross-lingual gaps via bias-variance decomposition, then apply inference-time interventions like response or input ensembling to reduce target response variance.

In practice

Topics

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.