Benchmarks in Leipzig
Summary
A new dataset of 100 research-level mathematics questions, compiled by 49 mathematicians between April 1 and May 15, 2026, was created primarily during the 3-day "Benchmarks in Leipzig" workshop at the Max Planck Institute for Mathematics in the Sciences. This collection was used to evaluate the mathematical reasoning capabilities of state-of-the-art Large Language Models through a three-stage process. Initially, five LLMs attempted the questions once, leaving 41 unsolved. A second stage involved 20 runs per model with three of these LLMs, reducing the unsolved count to 16. Finally, two "heavy-thinking" models performed three runs each, resulting in only 2 questions remaining unsolved. This progressive reduction in unsolved problems highlights the impressive and evolving mathematical reasoning abilities of current LLMs.
Key takeaway
For AI Scientists and Machine Learning Engineers evaluating LLM performance on complex reasoning tasks, you should recognize that current models demonstrate impressive mathematical capabilities. Your evaluation protocols should move beyond single-attempt tests. Consider implementing multi-stage evaluations, similar to the Leipzig benchmark, to fully assess an LLM's problem-solving potential. This approach can reveal deeper reasoning abilities, especially when using "heavy-thinking" models for challenging problems.
Key insights
Iterative, multi-stage evaluation reveals impressive mathematical reasoning capabilities in Large Language Models.
Principles
- LLM mathematical reasoning is improving.
- Multi-stage evaluation uncovers hidden LLM potential.
Method
A three-stage evaluation: initial single attempts by five LLMs, followed by 20 runs per model with three LLMs, then 3 runs with two "heavy-thinking" models.
In practice
- Develop research-level math benchmarks.
- Employ multi-stage LLM evaluation protocols.
Topics
- LLM Benchmarking
- Mathematical Reasoning
- Research Datasets
- Model Evaluation Protocols
- Large Language Models
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.