HorizonMath: Measuring AI Progress Toward Mathematical Discovery with Automatic Verification
Summary
HorizonMath is a new benchmark comprising over 100 predominantly unsolved problems across eight domains in computational and applied mathematics, designed to measure AI progress toward mathematical discovery. It includes an open-source evaluation framework for automated verification, focusing on problems where discovery demands significant mathematical insight but verification is computationally efficient. This design makes HorizonMath robust against data contamination, as solutions are unknown, and current state-of-the-art models typically score near 0%. Unlike existing research-level benchmarks that rely on expensive formal proof verification or manual review, HorizonMath offers scalable automated verification. Initial evaluations using this platform show that GPT 5.4 Pro proposed solutions for two problems that potentially improve upon the best-known published results, suggesting novel contributions awaiting expert review. HorizonMath is released as an open challenge and a community resource.
Key takeaway
For AI researchers focused on advancing mathematical reasoning, HorizonMath offers a unique, contamination-immune benchmark to test models on unsolved problems. Your team should consider integrating HorizonMath into your evaluation pipeline to identify true discovery capabilities, as GPT 5.4 Pro has already shown potential for novel contributions on this platform.
Key insights
HorizonMath benchmarks AI's ability to solve unsolved math problems with automated, scalable verification.
Principles
- Discovery is hard, verification can be simple.
- Unknown solutions prevent data contamination.
Method
HorizonMath evaluates AI on unsolved math problems using an open-source framework for automated, computationally efficient verification, bypassing expensive formal proof or manual review.
In practice
- Use HorizonMath for novel AI math research.
- Contribute solutions to expand the benchmark.
Topics
- Mathematical Discovery
- AI Benchmarking
- Large Language Models
- Automated Verification
- Computational Mathematics
Best for: AI Researcher, AI Scientist, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.