MATH-PT: A Math Reasoning Benchmark for European and Brazilian Portuguese
Summary
MATH-PT is a new benchmark dataset designed to evaluate large language models (LLMs) on complex mathematical reasoning in European and Brazilian Portuguese. This dataset comprises 1,729 mathematical problems sourced from high-quality native materials, including Olympiads, competitions, and exams from Portugal and Brazil, addressing the linguistic bias prevalent in existing English-centric benchmarks. A comprehensive evaluation of current state-of-the-art LLMs on MATH-PT reveals that frontier reasoning models perform strongly on multiple-choice questions. However, their performance significantly decreases when faced with questions containing figures or open-ended questions, indicating specific areas for improvement in multilingual mathematical reasoning. The dataset and model outputs are publicly released to support future research.
Key takeaway
For research scientists developing or evaluating LLMs for non-English markets, MATH-PT offers a critical tool to assess mathematical reasoning capabilities in Portuguese. You should prioritize testing models on questions involving figures and open-ended formats, as these represent significant performance gaps for current frontier models. Integrating MATH-PT into your evaluation pipeline will help identify specific areas for model improvement beyond multiple-choice accuracy.
Key insights
MATH-PT addresses linguistic bias in math reasoning benchmarks with 1,729 Portuguese problems.
Principles
- Native sources improve benchmark quality.
- Multilingual benchmarks reveal model limitations.
Method
The MATH-PT dataset was curated from mathematical Olympiads, competitions, and exams from Portugal and Brazil to create 1,729 problems in European and Brazilian Portuguese.
In practice
- Use MATH-PT for Portuguese LLM evaluation.
- Focus on figure-based and open-ended questions.
Topics
- Large Language Models
- Mathematical Reasoning
- Benchmark Datasets
- European Portuguese
- Brazilian Portuguese
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.