MATH-PT: A Math Reasoning Benchmark for European and Brazilian Portuguese

· Source: Paper Index on ACL Anthology · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

MATH-PT is a new benchmark dataset designed to evaluate large language models (LLMs) on complex mathematical reasoning in European and Brazilian Portuguese. This dataset comprises 1,729 mathematical problems sourced from high-quality native materials, including Olympiads, competitions, and exams from Portugal and Brazil, addressing the linguistic bias prevalent in existing English-centric benchmarks. A comprehensive evaluation of current state-of-the-art LLMs on MATH-PT reveals that frontier reasoning models perform strongly on multiple-choice questions. However, their performance significantly decreases when faced with questions containing figures or open-ended questions, indicating specific areas for improvement in multilingual mathematical reasoning. The dataset and model outputs are publicly released to support future research.

Key takeaway

For research scientists developing or evaluating LLMs for non-English markets, MATH-PT offers a critical tool to assess mathematical reasoning capabilities in Portuguese. You should prioritize testing models on questions involving figures and open-ended formats, as these represent significant performance gaps for current frontier models. Integrating MATH-PT into your evaluation pipeline will help identify specific areas for model improvement beyond multiple-choice accuracy.

Key insights

MATH-PT addresses linguistic bias in math reasoning benchmarks with 1,729 Portuguese problems.

Principles

Method

The MATH-PT dataset was curated from mathematical Olympiads, competitions, and exams from Portugal and Brazil to create 1,729 problems in European and Brazilian Portuguese.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.