MATH-PT: A Math Reasoning Benchmark for European and Brazilian Portuguese
Summary
Researchers introduce Math-PT, a new benchmark dataset designed to evaluate large language models (LLMs) on complex mathematical reasoning in European and Brazilian Portuguese. This dataset comprises 1,729 problems curated from high-quality native sources like mathematical Olympiads and exams from Portugal and Brazil, covering primary to pre-university levels. Unlike existing benchmarks, which are predominantly English or translated, Math-PT addresses a significant linguistic bias. A comprehensive evaluation of 13 frontier and open-source LLMs on Math-PT revealed that top-tier models like GPT-5 achieve strong performance on multiple-choice questions but show decreased accuracy for open-ended questions or those involving figures. The dataset and model outputs are publicly released to foster further research in Portuguese mathematical reasoning.
Key takeaway
For research scientists developing or evaluating LLMs for non-English markets, Math-PT highlights a critical need for native language benchmarks. You should consider the observed performance drops in open-ended and figure-dependent questions as key areas for model improvement, especially when targeting Portuguese-speaking users. Prioritize developing multimodal reasoning capabilities and robust open-ended answer generation for better real-world applicability.
Key insights
Math-PT is the first native Portuguese benchmark for LLM mathematical reasoning, revealing performance gaps in open-ended and visual problems.
Principles
- Linguistic bias affects LLM mathematical proficiency evaluation.
- Frontier models outperform open-weight models in math reasoning.
- Visual elements and open-ended formats reduce LLM accuracy.
Method
Math-PT was created by curating 1,729 math problems from Portuguese Olympiads and exams, converting LaTeX sources to plain text, and using gpt-5-mini for PDF extraction of Brazilian Portuguese questions into a structured JSON format.
In practice
- Use Math-PT to benchmark LLMs for Portuguese math tasks.
- Focus on improving LLM performance on math problems with figures.
- Develop strategies for open-ended math question answering.
Topics
- Math-PT Benchmark
- Mathematical Reasoning
- Large Language Models
- Multilingual Benchmarking
- Portuguese Language Models
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.