Riemann-Bench: A Benchmark for Moonshot Mathematics

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Emerging Technologies & Innovation · Depth: Expert, extended

Summary

Riemann-Bench is a new, private benchmark comprising 25 expert-curated problems designed to evaluate AI systems on PhD-level research mathematics, moving beyond competition-style problem-solving. Unlike benchmarks such as GSM8K or MATH, which focus on grade-school or olympiad-level math where frontier models now achieve gold-medal performance (e.g., Gemini with Deep Think scored 35/42 on IMO 2025, DeepSeekMath-V2 scored 118/120 on Putnam 2024), Riemann-Bench problems are authored by Ivy League professors and PhD-holding IMO medalists, often taking weeks to solve. Each problem has a unique, closed-form solution verified by two independent domain experts and assessed programmatically. Evaluations involve unconstrained AI research agents with full access to coding tools and search, running 100 independent attempts per problem. Current frontier models score below 10% on Riemann-Bench, highlighting a significant gap between competition-level and genuine research-level mathematical reasoning.

Key takeaway

For AI Scientists and Machine Learning Engineers developing advanced reasoning models, recognize that current systems, despite excelling at olympiad-level math, still score below 10% on research-level problems. Your focus should shift from achieving autonomous mathematical reasoning to developing AI-assisted tools that support human mathematicians in specific subtasks, where human verification can mitigate the risk of fabricated solutions and inapplicable theoretical frameworks.

Key insights

Research-level mathematics remains largely beyond current AI capabilities, despite strong performance on competition problems.

Principles

Method

Riemann-Bench uses 25 expert-curated, PhD-level problems, double-blind verified by independent experts, with programmatic solution assessment. Models are evaluated as unconstrained agents over 100 runs per problem.

In practice

Topics

Best for: AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.