Formal Conjectures: An Open and Evolving Benchmark for Verified Discovery in Mathematics
Summary
Google DeepMind and Imperial College London researchers introduce "Formal Conjectures," an open-source, evolving benchmark of 2615 mathematical problem statements formalized in Lean 4. This dataset, available on GitHub, addresses limitations of existing automated reasoning benchmarks like data leakage and saturation by focusing on 1029 open research conjectures, providing a zero-contamination testbed for proof discovery. It also includes 836 solved problems for proof auto-formalization. The benchmark features a structured interface for mathematicians and AI systems, a collaborative methodology for ensuring formalization correctness, and a standardized evaluation setup with frozen subsets like FC100SolvedSet1 and FC100OpenSet1. Initial evaluations on FC100SolvedSet1 show AlphaProof achieving a 45-50% solve rate and a DeepMind prover agent reaching 66%, demonstrating its utility in measuring advancements in automated reasoning.
Key takeaway
For AI Scientists and Machine Learning Engineers developing automated reasoning systems, Formal Conjectures provides a robust, evolving benchmark to validate your models' capabilities on research-level mathematics. Focus on developing systems that can tackle the zero-contamination open conjectures to demonstrate true mathematical discovery, and utilize the auto-formalization track to refine your models' ability to translate informal math into Lean 4. Engage with the community and contribute to the benchmark to accelerate the frontier of formal mathematical research.
Key insights
Formal Conjectures offers a dynamic, open-source benchmark for evaluating AI in advanced mathematical proof discovery and auto-formalization.
Principles
- Zero-contamination testbeds are crucial for evaluating genuine AI reasoning.
- Formalization clarifies mathematical statements and reveals gaps in libraries.
- Iterative auditing by AI improves formalization fidelity.
Method
Problems are formalized in Lean 4, categorized (e.g., research open, solved), and undergo human and AI review. The `leananswer(sorry)` mechanism separates answer discovery from proof verification, and `FormalConjecturesForMathlib` stages new definitions.
In practice
- Use `leananswer(sorry)` for problems requiring a specific answer or truth value.
- Contribute auxiliary definitions to `FormalConjecturesForMathlib` for upstreaming.
- Employ AI tools for cross-checking formalizations against informal sources.
Topics
- Formal Conjectures Benchmark
- Lean 4 Formalization
- Automated Theorem Proving
- Mathematical Proof Discovery
- Proof Auto-formalization
Code references
Best for: AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.