Benchmarks in Leipzig

2026-04-23 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, AI in Mathematics · Depth: Expert, extended

Summary

The "Benchmarks in Leipzig" project, conducted between April 1 and May 15, 2026, involved 49 mathematicians compiling a dataset of 100 research-level mathematics questions with known answers. This effort culminated in a 3-day workshop (May 11–13, 2026) with 35 participants at the Max Planck Institute in Leipzig. The compiled questions were used to evaluate current Large Language Models (LLMs) across three stages. Initially, 5 LLMs made single attempts, leaving 41 questions unsolved. A second stage involved 20 runs per model for three LLMs, reducing the unsolved count to 16. Finally, two "heavy-thinking" models, GPT-5.5 Pro and Gemini 3.1 Pro Deep Think, were tested with three runs each, resulting in only 2 questions remaining completely unsolved. This comprehensive evaluation demonstrates the increasingly impressive mathematical reasoning capabilities of current LLMs.

Key takeaway

For research scientists evaluating LLMs for advanced mathematical reasoning, you should adopt multi-stage, multi-run evaluation protocols to accurately gauge model capabilities and consistency. The "Benchmarks in Leipzig" project shows that even highly complex problems are becoming solvable, suggesting you integrate AI chat tools into your research for potential peer review support and problem-solving. Be aware that model performance can vary significantly across runs.

Key insights

LLMs exhibit impressive and improving mathematical reasoning, solving nearly all research-level math questions in a new benchmark.

Principles

LLM performance varies drastically.
Multi-run evaluations reveal model consistency.
AI can aid mathematical peer review.

Method

A three-stage evaluation process: initial single-run attempts, followed by 20-run and then 3-run multi-attempt evaluations using progressively more advanced LLMs and "heavy-thinking" configurations. Questions were curated by mathematicians on the ScienceBench platform.

In practice

Use multi-run evaluations for robust LLM assessment.
Explore AI chat tools for mathematical research.
Consider AI for peer review of math content.

Topics

Large Language Models
Mathematical Reasoning
Benchmarking
ScienceBench Platform
AI-assisted Peer Review
Research Mathematics

Best for: AI Scientist, Research Scientist, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.