OProver: A Unified Framework for Agentic Formal Theorem Proving

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences · Depth: Expert, extended

Summary

OProver is a new unified framework for agentic formal theorem proving in Lean 4, integrating iterative proof revision directly into the training process rather than as an inference-time heuristic. It uses retrieved compiler-verified proofs and Lean compiler feedback to refine failed proof attempts. The framework is trained through continued pretraining on Lean code and mathematics, followed by iterative post-training that includes agentic proving, supervised fine-tuning (SFT) on repair trajectories, and reinforcement learning (RL) on unresolved cases. OProver is paired with OProofs, a large-scale corpus containing 1.77M Lean statements, 6.86M compiler-verified proofs, and serialized proving trajectories with retrieved context, failed attempts, feedback, and repairs. OProver-32B achieved state-of-the-art Pass@32 scores on MiniF2F (93.3%), ProverBench (58.2%), and PutnamBench (11.3%), and ranked second on MathOlympiad (22.8%) and ProofNet (33.2%), outperforming prior open-weight whole-proof provers.

Key takeaway

For AI Scientists and Machine Learning Engineers developing formal theorem provers, OProver demonstrates that integrating multi-round, feedback-conditioned refinement directly into the training loop, rather than as a post-hoc augmentation, yields superior performance. You should focus on building systems that learn from iterative repair trajectories and compiler feedback, as this approach significantly boosts success rates across diverse mathematical benchmarks. Consider developing evolving corpora that grow with your prover's capabilities.

Key insights

Integrating agentic proving and compiler feedback directly into training significantly enhances formal theorem proving performance.

Principles

Method

OProver trains a policy to iteratively revise proofs using retrieved context and Lean 4 compiler feedback, employing continued pretraining, SFT on repair trajectories, and RL on hard cases, with newly verified proofs expanding the corpus.

In practice

Topics

Best for: AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.