Discover and Prove: An Open-source Agentic Framework for Hard Mode Automated Theorem Proving in Lean 4

2026-04-17 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences · Depth: Expert, quick

Summary

A new open-source agentic framework, Discover And Prove (DAP), has been released to address the "Hard Mode" challenge in automated theorem proving (ATP) within Lean 4. Unlike "Easy Mode" benchmarks where answers are embedded in formal statements, Hard Mode requires systems to independently discover the answer before constructing a formal proof. To facilitate this, the authors also released MiniF2F-Hard and FIMO-Hard, reannotated Hard Mode variants of existing ATP benchmarks. DAP utilizes large language model (LLM) natural-language reasoning with explicit self-reflection to discover answers, then reformulates Hard Mode statements into Easy Mode for existing ATP provers. DAP achieved a new state of the art on CombiBench, increasing solved problems from 7 (previous Pass@16 SOTA) to 10, and is the first system to formally prove 36 theorems in Hard Mode on PutnamBench.

Key takeaway

For AI Scientists and Machine Learning Engineers developing automated theorem provers, you should consider adopting Hard Mode benchmarks like MiniF2F-Hard and FIMO-Hard. This shift will provide a more realistic assessment of your models' capabilities, particularly in answer discovery, and help identify areas for improvement beyond mere proof generation.

Key insights

Hard Mode ATP requires independent answer discovery before formal proof, revealing a significant gap between LLM answer accuracy and formal prover capability.

Principles

Explicit self-reflection improves LLM reasoning.
Hard Mode benchmarks reveal true model capabilities.

Method

DAP uses LLM natural-language reasoning and self-reflection to discover answers, then rewrites Hard Mode problems into Easy Mode for existing ATP provers.

In practice

Use MiniF2F-Hard for Hard Mode ATP evaluation.
Apply LLM self-reflection for complex reasoning tasks.

Topics

Automated Theorem Proving
Hard Mode Benchmarks
Discover And Prove
Large Language Models
Lean 4

Best for: AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.