MATH-PT: A Math Reasoning Benchmark for European and Brazilian Portuguese

2026-04-30 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, long

Summary

Researchers introduce Math-PT, a new benchmark dataset designed to evaluate large language models (LLMs) on complex mathematical reasoning in European and Brazilian Portuguese. This dataset comprises 1,729 problems curated from high-quality native sources like mathematical Olympiads and exams from Portugal and Brazil, covering primary to pre-university levels. Unlike existing benchmarks, which are predominantly English or translated, Math-PT addresses a significant linguistic bias. A comprehensive evaluation of 13 frontier and open-source LLMs on Math-PT revealed that top-tier models like GPT-5 achieve strong performance on multiple-choice questions but show decreased accuracy for open-ended questions or those involving figures. The dataset and model outputs are publicly released to foster further research in Portuguese mathematical reasoning.

Key takeaway

For research scientists developing or evaluating LLMs for non-English markets, Math-PT highlights a critical need for native language benchmarks. You should consider the observed performance drops in open-ended and figure-dependent questions as key areas for model improvement, especially when targeting Portuguese-speaking users. Prioritize developing multimodal reasoning capabilities and robust open-ended answer generation for better real-world applicability.

Key insights

Math-PT is the first native Portuguese benchmark for LLM mathematical reasoning, revealing performance gaps in open-ended and visual problems.

Principles

Linguistic bias affects LLM mathematical proficiency evaluation.
Frontier models outperform open-weight models in math reasoning.
Visual elements and open-ended formats reduce LLM accuracy.

Method

Math-PT was created by curating 1,729 math problems from Portuguese Olympiads and exams, converting LaTeX sources to plain text, and using gpt-5-mini for PDF extraction of Brazilian Portuguese questions into a structured JSON format.

In practice

Use Math-PT to benchmark LLMs for Portuguese math tasks.
Focus on improving LLM performance on math problems with figures.
Develop strategies for open-ended math question answering.

Topics

Math-PT Benchmark
Mathematical Reasoning
Large Language Models
Multilingual Benchmarking
Portuguese Language Models

Code references

deep-spin/math-benchmark

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.