mmPISA-bench: Do LLMs Reason Equally Well Across 43 Languages?
Summary
mmPISA-bench is a new compact, high-quality multilingual reasoning benchmark derived from the OECD Programme for International Student Assessment (PISA). It features 25 multiple-choice questions requiring reasoning, available in official human translations across 43 languages and complemented by machine-translated versions, totaling 2,150 data points. Evaluation of two mainstream proprietary LLMs on this benchmark revealed that modern LLMs can reason effectively across all 43 languages, achieving accuracy comparable to human test-takers, though with some performance variations. A key finding is that machine-translated questions do not degrade accuracy compared to human translations, suggesting synthetic data can be adequate for large-scale multilingual evaluations. The analysis also identified that LLM usage in certain languages is simultaneously more expensive and less accurate.
Key takeaway
For NLP Engineers developing multilingual LLM applications, you should consider leveraging high-quality machine translation for creating large-scale evaluation datasets, as it proves comparable to human translations for reasoning tasks. Be aware that while LLMs reason well across languages, performance and inference costs can vary significantly by language, necessitating careful benchmarking and optimization for specific target markets.
Key insights
LLMs demonstrate effective multilingual reasoning across 43 languages, with machine translations proving as effective as human ones.
Principles
- LLMs reason effectively across diverse languages.
- Machine translation can yield high-quality evaluation data.
- Performance and cost vary significantly by language.
Method
mmPISA-bench uses 25 PISA-derived multiple-choice questions, translated into 43 languages (human and machine), to evaluate LLM reasoning capabilities and cost.
In practice
- Use machine translation for multilingual evaluation data.
- Benchmark LLM performance across diverse languages.
- Monitor inference cost variations per language.
Topics
- Multilingual LLMs
- Reasoning Benchmarks
- mmPISA-bench
- Machine Translation Quality
- LLM Inference Cost
- Cross-lingual Evaluation
Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Scientist, NLP Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.