Discover and Prove: An Open-source Agentic Framework for Hard Mode Automated Theorem Proving in Lean 4

2026-04-20 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Formal Reasoning & Automated Theorem Proving · Depth: Expert, extended

Summary

A new open-source agentic framework, Discover And Prove (DAP), has been developed to tackle "Hard Mode" automated theorem proving (ATP) in Lean 4. This mode requires AI systems to independently discover the answer before constructing a formal proof, unlike "Easy Mode" benchmarks where the answer is embedded. DAP introduces two expert-reannotated Hard Mode benchmarks, MiniF2F-Hard and FIMO-Hard, derived from existing ATP datasets. The framework itself comprises a Discovery Module, which uses LLM natural-language reasoning with self-reflection to find answers, and a Proving Module, which then rewrites the problem into an Easy Mode statement for existing ATP provers. DAP achieves state-of-the-art results, solving 10 problems on CombiBench (up from 7) and formally proving 36 theorems on PutnamBench in Hard Mode, a first for this setting. The study also reveals a significant gap: LLMs achieve over 80% answer accuracy where formal provers manage under 10% in Hard Mode.

Key takeaway

For research scientists developing AI for mathematical reasoning, you should prioritize developing systems capable of independent answer discovery before formal proof construction. The DAP framework demonstrates that combining natural-language reasoning with formal methods is effective for Hard Mode ATP, highlighting the need to bridge the performance gap between LLM-based answer accuracy and formal prover capabilities. Consider integrating self-verification mechanisms for challenging problems to improve overall system reliability.

Key insights

Hard Mode ATP, requiring answer discovery before proof, reveals a significant gap between LLM reasoning and formal provers.

Principles

Semantic alignment is crucial for realistic ATP benchmarks.
Decoupling answer discovery from formal proving enhances performance.
Self-verification improves accuracy on complex problems.

Method

The DAP framework uses a Discovery Module for natural language answer generation and self-correction, followed by a Proving Module that converts the problem to Easy Mode for formal verification using Goedel-Prover-V2.

In practice

Utilize agentic LLMs for initial answer discovery.
Re-annotate existing benchmarks for "Hard Mode" evaluation.
Decouple reasoning and proving modules for flexibility.

Topics

Automated Theorem Proving
Agentic Frameworks
Lean 4
Hard Mode Benchmarks
LLM-based Discovery

Code references

Best for: Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.