Benchmarks in Leipzig

2026-06-04 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, AI for Mathematical Reasoning · Depth: Expert, quick

Summary

A group of 49 mathematicians, collaborating between April 1 and May 15, 2026, primarily during the "Benchmarks in Leipzig" workshop at the Max Planck Institute, compiled a dataset of 100 research-level mathematics questions. These questions were used to evaluate state-of-the-art LLMs in a three-stage process. Initially, five LLMs attempted each question once, leaving 41 unsolved. A second stage involved 20 runs per model for three LLMs, reducing unsolved questions to 16. Finally, two "heavy-thinking" models performed 3 runs, resulting in only 2 remaining unsolved questions. This evaluation demonstrates the increasingly impressive mathematical reasoning capabilities of current LLMs.

Key takeaway

For research scientists evaluating LLM capabilities in complex mathematical reasoning, this study indicates that current models can achieve near-perfect accuracy on research-level problems. You should consider implementing multi-attempt evaluation strategies, as iterative runs significantly reduce unsolved questions. This approach provides a more robust assessment of an LLM's true problem-solving potential beyond single-shot performance.

Key insights

LLMs exhibit impressive mathematical reasoning, solving nearly all research-level problems when given iterative attempts.

Principles

Iterative attempts significantly improve LLM math performance.
Dataset creation benefits from focused workshops.

Method

A three-stage LLM evaluation process: initial single attempts, followed by 20-run evaluations for a subset of models, then 3-run evaluations for "heavy-thinking" models.

In practice

Develop research-level math datasets for LLM benchmarking.
Implement multi-attempt strategies for complex LLM tasks.

Topics

LLM Evaluation
Mathematical Reasoning
Research Mathematics
Dataset Creation
Benchmarks in Leipzig
Max Planck Institute

Best for: AI Scientist, Research Scientist

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.