mmPISA-bench: Do LLMs Reason Equally Well Across 43 Languages?

2026-06-05 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, quick

Summary

mmPISA-bench is a new compact, high-quality multilingual reasoning benchmark derived from the OECD Programme for International Student Assessment (PISA). It features 25 multiple-choice questions requiring reasoning, available in official human translations across 43 languages and complemented by machine-translated versions, totaling 2,150 data points. Evaluation of two mainstream proprietary LLMs on this benchmark revealed that modern LLMs can reason effectively across all 43 languages, achieving accuracy comparable to human test-takers, though with some performance variations. A key finding is that machine-translated questions do not degrade accuracy compared to human translations, suggesting synthetic data can be adequate for large-scale multilingual evaluations. The analysis also identified that LLM usage in certain languages is simultaneously more expensive and less accurate.

Key takeaway

For NLP Engineers developing multilingual LLM applications, you should consider leveraging high-quality machine translation for creating large-scale evaluation datasets, as it proves comparable to human translations for reasoning tasks. Be aware that while LLMs reason well across languages, performance and inference costs can vary significantly by language, necessitating careful benchmarking and optimization for specific target markets.

Key insights

LLMs demonstrate effective multilingual reasoning across 43 languages, with machine translations proving as effective as human ones.

Principles

LLMs reason effectively across diverse languages.
Machine translation can yield high-quality evaluation data.
Performance and cost vary significantly by language.

Method

mmPISA-bench uses 25 PISA-derived multiple-choice questions, translated into 43 languages (human and machine), to evaluate LLM reasoning capabilities and cost.

In practice

Use machine translation for multilingual evaluation data.
Benchmark LLM performance across diverse languages.
Monitor inference cost variations per language.

Topics

Multilingual LLMs
Reasoning Benchmarks
mmPISA-bench
Machine Translation Quality
LLM Inference Cost
Cross-lingual Evaluation

Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Scientist, NLP Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.