Benchmarks in Leipzig
Summary
A group of 49 mathematicians, collaborating between April 1 and May 15, 2026, primarily during the "Benchmarks in Leipzig" workshop at the Max Planck Institute, compiled a dataset of 100 research-level mathematics questions. These questions were used to evaluate state-of-the-art LLMs in a three-stage process. Initially, five LLMs attempted each question once, leaving 41 unsolved. A second stage involved 20 runs per model for three LLMs, reducing unsolved questions to 16. Finally, two "heavy-thinking" models performed 3 runs, resulting in only 2 remaining unsolved questions. This evaluation demonstrates the increasingly impressive mathematical reasoning capabilities of current LLMs.
Key takeaway
For research scientists evaluating LLM capabilities in complex mathematical reasoning, this study indicates that current models can achieve near-perfect accuracy on research-level problems. You should consider implementing multi-attempt evaluation strategies, as iterative runs significantly reduce unsolved questions. This approach provides a more robust assessment of an LLM's true problem-solving potential beyond single-shot performance.
Key insights
LLMs exhibit impressive mathematical reasoning, solving nearly all research-level problems when given iterative attempts.
Principles
- Iterative attempts significantly improve LLM math performance.
- Dataset creation benefits from focused workshops.
Method
A three-stage LLM evaluation process: initial single attempts, followed by 20-run evaluations for a subset of models, then 3-run evaluations for "heavy-thinking" models.
In practice
- Develop research-level math datasets for LLM benchmarking.
- Implement multi-attempt strategies for complex LLM tasks.
Topics
- LLM Evaluation
- Mathematical Reasoning
- Research Mathematics
- Dataset Creation
- Benchmarks in Leipzig
- Max Planck Institute
Best for: AI Scientist, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.