Formal Conjectures: An Open and Evolving Benchmark for Verified Discovery in Mathematics

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences, Software Development & Engineering · Depth: Expert, extended

Summary

Google DeepMind and Imperial College London researchers introduce "Formal Conjectures," an open-source, evolving benchmark of 2615 mathematical problem statements formalized in Lean 4. This dataset, available on GitHub, addresses limitations of existing automated reasoning benchmarks like data leakage and saturation by focusing on 1029 open research conjectures, providing a zero-contamination testbed for proof discovery. It also includes 836 solved problems for proof auto-formalization. The benchmark features a structured interface for mathematicians and AI systems, a collaborative methodology for ensuring formalization correctness, and a standardized evaluation setup with frozen subsets like FC100SolvedSet1 and FC100OpenSet1. Initial evaluations on FC100SolvedSet1 show AlphaProof achieving a 45-50% solve rate and a DeepMind prover agent reaching 66%, demonstrating its utility in measuring advancements in automated reasoning.

Key takeaway

For AI Scientists and Machine Learning Engineers developing automated reasoning systems, Formal Conjectures provides a robust, evolving benchmark to validate your models' capabilities on research-level mathematics. Focus on developing systems that can tackle the zero-contamination open conjectures to demonstrate true mathematical discovery, and utilize the auto-formalization track to refine your models' ability to translate informal math into Lean 4. Engage with the community and contribute to the benchmark to accelerate the frontier of formal mathematical research.

Key insights

Formal Conjectures offers a dynamic, open-source benchmark for evaluating AI in advanced mathematical proof discovery and auto-formalization.

Principles

Method

Problems are formalized in Lean 4, categorized (e.g., research open, solved), and undergo human and AI review. The `leananswer(sorry)` mechanism separates answer discovery from proof verification, and `FormalConjecturesForMathlib` stages new definitions.

In practice

Topics

Code references

Best for: AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.