mmPISA-bench: Do LLMs Reason Equally Well Across 43 Languages?
Summary
mmPISA-bench is a new compact, high-quality multilingual reasoning benchmark derived from the OECD Programme for International Student Assessment (PISA). It features 25 multiple-choice questions, requiring reasoning, provided in official human translations across 43 languages and complemented by machine-translated versions, totaling 2,150 data points. Researchers evaluated two proprietary LLMs, OpenAI's GPT-5.1-2025-11-13 and Anthropic's Opus-4-5-20251101/Haiku-4-5-20251001, across these languages, reasoning effort levels, and translation types. Results indicate that modern LLMs reason effectively across all 43 languages, achieving accuracy comparable to human test-takers, with some performance variations. Crucially, machine-translated questions did not degrade accuracy, suggesting their adequacy for large-scale multilingual evaluations. Analysis also revealed that LLM usage in some languages is simultaneously more expensive and less accurate due to token usage disparities.
Key takeaway
For AI Scientists evaluating multilingual LLMs, you should consider that machine-translated datasets can be as effective as human-translated ones for reasoning tasks, enabling broader language coverage. However, you must also account for significant variations in inference cost and accuracy across languages. Prioritize cost-aware evaluation and qualitative analysis, especially for non-English languages, to identify hidden cross-language reasoning patterns and optimize deployment strategies.
Key insights
LLMs reason effectively across 43 languages, with machine translation not degrading accuracy.
Principles
- LLM reasoning performance varies across languages.
- Machine-translated questions do not degrade accuracy.
- Higher inference cost often correlates with lower accuracy.
Method
Evaluate LLMs on 25 PISA-derived multiple-choice questions in 43 human and machine-translated languages, analyzing accuracy, token usage, and reasoning effort.
In practice
- Use machine translation for large-scale multilingual evaluations.
- Monitor token usage for cost-aware multilingual LLM deployment.
- Qualitatively inspect reasoning for cross-language behaviors.
Topics
- Multilingual LLMs
- Reasoning Benchmarks
- PISA Assessment
- Machine Translation
- Tokenization Efficiency
- Inference Cost
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.