Benchmarks in Leipzig
Summary
The "Benchmarks in Leipzig" project, conducted between April 1 and May 15, 2026, involved 49 mathematicians compiling a dataset of 100 research-level mathematics questions with known answers. This effort culminated in a 3-day workshop (May 11–13, 2026) with 35 participants at the Max Planck Institute in Leipzig. The compiled questions were used to evaluate current Large Language Models (LLMs) across three stages. Initially, 5 LLMs made single attempts, leaving 41 questions unsolved. A second stage involved 20 runs per model for three LLMs, reducing the unsolved count to 16. Finally, two "heavy-thinking" models, GPT-5.5 Pro and Gemini 3.1 Pro Deep Think, were tested with three runs each, resulting in only 2 questions remaining completely unsolved. This comprehensive evaluation demonstrates the increasingly impressive mathematical reasoning capabilities of current LLMs.
Key takeaway
For research scientists evaluating LLMs for advanced mathematical reasoning, you should adopt multi-stage, multi-run evaluation protocols to accurately gauge model capabilities and consistency. The "Benchmarks in Leipzig" project shows that even highly complex problems are becoming solvable, suggesting you integrate AI chat tools into your research for potential peer review support and problem-solving. Be aware that model performance can vary significantly across runs.
Key insights
LLMs exhibit impressive and improving mathematical reasoning, solving nearly all research-level math questions in a new benchmark.
Principles
- LLM performance varies drastically.
- Multi-run evaluations reveal model consistency.
- AI can aid mathematical peer review.
Method
A three-stage evaluation process: initial single-run attempts, followed by 20-run and then 3-run multi-attempt evaluations using progressively more advanced LLMs and "heavy-thinking" configurations. Questions were curated by mathematicians on the ScienceBench platform.
In practice
- Use multi-run evaluations for robust LLM assessment.
- Explore AI chat tools for mathematical research.
- Consider AI for peer review of math content.
Topics
- Large Language Models
- Mathematical Reasoning
- Benchmarking
- ScienceBench Platform
- AI-assisted Peer Review
- Research Mathematics
Best for: AI Scientist, Research Scientist, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.