Riemann-Bench: A Benchmark for Moonshot Mathematics
Summary
Riemann-Bench is a new, private benchmark comprising 25 expert-curated problems designed to evaluate AI systems on PhD-level research mathematics, moving beyond competition-style problem-solving. Unlike benchmarks such as GSM8K or MATH, which focus on grade-school or olympiad-level math where frontier models now achieve gold-medal performance (e.g., Gemini with Deep Think scored 35/42 on IMO 2025, DeepSeekMath-V2 scored 118/120 on Putnam 2024), Riemann-Bench problems are authored by Ivy League professors and PhD-holding IMO medalists, often taking weeks to solve. Each problem has a unique, closed-form solution verified by two independent domain experts and assessed programmatically. Evaluations involve unconstrained AI research agents with full access to coding tools and search, running 100 independent attempts per problem. Current frontier models score below 10% on Riemann-Bench, highlighting a significant gap between competition-level and genuine research-level mathematical reasoning.
Key takeaway
For AI Scientists and Machine Learning Engineers developing advanced reasoning models, recognize that current systems, despite excelling at olympiad-level math, still score below 10% on research-level problems. Your focus should shift from achieving autonomous mathematical reasoning to developing AI-assisted tools that support human mathematicians in specific subtasks, where human verification can mitigate the risk of fabricated solutions and inapplicable theoretical frameworks.
Key insights
Research-level mathematics remains largely beyond current AI capabilities, despite strong performance on competition problems.
Principles
- Research math requires deep theoretical knowledge.
- Competition math often rewards insightful tricks.
- Private benchmarks prevent data contamination.
Method
Riemann-Bench uses 25 expert-curated, PhD-level problems, double-blind verified by independent experts, with programmatic solution assessment. Models are evaluated as unconstrained agents over 100 runs per problem.
In practice
- AI models can fabricate theorems when challenged.
- AI is better suited for subtasks in research.
- Focus on AI-assisted research, not autonomy.
Topics
- Riemann-Bench
- Research Mathematics
- AI Mathematical Reasoning
- Benchmark Evaluation
- Olympiad Mathematics
Best for: AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.