Discover and Prove: An Open-source Agentic Framework for Hard Mode Automated Theorem Proving in Lean 4
Summary
A new open-source agentic framework, Discover And Prove (DAP), has been developed to tackle "Hard Mode" automated theorem proving (ATP) in Lean 4. This mode requires AI systems to independently discover the answer before constructing a formal proof, unlike "Easy Mode" benchmarks where the answer is embedded. DAP introduces two expert-reannotated Hard Mode benchmarks, MiniF2F-Hard and FIMO-Hard, derived from existing ATP datasets. The framework itself comprises a Discovery Module, which uses LLM natural-language reasoning with self-reflection to find answers, and a Proving Module, which then rewrites the problem into an Easy Mode statement for existing ATP provers. DAP achieves state-of-the-art results, solving 10 problems on CombiBench (up from 7) and formally proving 36 theorems on PutnamBench in Hard Mode, a first for this setting. The study also reveals a significant gap: LLMs achieve over 80% answer accuracy where formal provers manage under 10% in Hard Mode.
Key takeaway
For research scientists developing AI for mathematical reasoning, you should prioritize developing systems capable of independent answer discovery before formal proof construction. The DAP framework demonstrates that combining natural-language reasoning with formal methods is effective for Hard Mode ATP, highlighting the need to bridge the performance gap between LLM-based answer accuracy and formal prover capabilities. Consider integrating self-verification mechanisms for challenging problems to improve overall system reliability.
Key insights
Hard Mode ATP, requiring answer discovery before proof, reveals a significant gap between LLM reasoning and formal provers.
Principles
- Semantic alignment is crucial for realistic ATP benchmarks.
- Decoupling answer discovery from formal proving enhances performance.
- Self-verification improves accuracy on complex problems.
Method
The DAP framework uses a Discovery Module for natural language answer generation and self-correction, followed by a Proving Module that converts the problem to Easy Mode for formal verification using Goedel-Prover-V2.
In practice
- Utilize agentic LLMs for initial answer discovery.
- Re-annotate existing benchmarks for "Hard Mode" evaluation.
- Decouple reasoning and proving modules for flexibility.
Topics
- Automated Theorem Proving
- Agentic Frameworks
- Lean 4
- Hard Mode Benchmarks
- LLM-based Discovery
Code references
- liuchengwucn/discover-and-prove
- Huawei-xiaoyi/IMO2025-solutions
- jsm28/IMOLean
- mortarsanjaya/IMOSLLean4
- aw31/openai-imo-2025-proofs
Best for: Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.