Benchmarks in Leipzig

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, AI for Mathematical Reasoning · Depth: Expert, quick

Summary

A group of 49 mathematicians, collaborating between April 1 and May 15, 2026, primarily during the "Benchmarks in Leipzig" workshop at the Max Planck Institute, compiled a dataset of 100 research-level mathematics questions. These questions were used to evaluate state-of-the-art LLMs in a three-stage process. Initially, five LLMs attempted each question once, leaving 41 unsolved. A second stage involved 20 runs per model for three LLMs, reducing unsolved questions to 16. Finally, two "heavy-thinking" models performed 3 runs, resulting in only 2 remaining unsolved questions. This evaluation demonstrates the increasingly impressive mathematical reasoning capabilities of current LLMs.

Key takeaway

For research scientists evaluating LLM capabilities in complex mathematical reasoning, this study indicates that current models can achieve near-perfect accuracy on research-level problems. You should consider implementing multi-attempt evaluation strategies, as iterative runs significantly reduce unsolved questions. This approach provides a more robust assessment of an LLM's true problem-solving potential beyond single-shot performance.

Key insights

LLMs exhibit impressive mathematical reasoning, solving nearly all research-level problems when given iterative attempts.

Principles

Method

A three-stage LLM evaluation process: initial single attempts, followed by 20-run evaluations for a subset of models, then 3-run evaluations for "heavy-thinking" models.

In practice

Topics

Best for: AI Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.