Process-Verified Reinforcement Learning for Theorem Proving via Lean
Summary
This work demonstrates that the Lean proof assistant itself can serve as a symbolic process oracle, supplying both outcome-level and fine-grained tactic-level verified feedback during training for reinforcement learning from verifiable rewards (RLVR). Proof attempts are parsed into tactic sequences, with Lean marking locally sound steps and the earliest failing step, yielding dense, verifier-grounded credit signals rooted in type theory. These structured rewards are incorporated into a GRPO-style reinforcement learning objective using first-error propagation and first-token credit methods. Experiments with STP-Lean and DeepSeek-Prover-V1.5 show that tactic-level supervision outperforms outcome-only baselines in most settings, delivering improvements on benchmarks such as MiniF2F and ProofNet. This approach positions symbolic proof assistants as process-level reward oracles during training, combining the scalability of language models with the reliability of symbolic verification for formal reasoning.
Key takeaway
For AI Scientists and Machine Learning Engineers developing automated theorem provers, you should integrate process-level feedback from symbolic proof assistants like Lean. This approach, using tactic-level supervision and first-error propagation, significantly enhances performance over outcome-only methods on benchmarks such as MiniF2F and ProofNet. Consider utilizing these fine-grained verification signals to improve the reliability and scalability of your formal reasoning systems.
Key insights
Lean proof assistant provides fine-grained, process-level verified feedback for reinforcement learning in theorem proving.
Principles
- Symbolic proof assistants offer dense, sound feedback.
- Tactic-level supervision improves theorem proving.
- Combine LM scalability with symbolic reliability.
Method
Proof attempts are parsed into tactic sequences; Lean marks sound steps and the first error. These signals are incorporated into a GRPO-style RL objective with first-error propagation and first-token credit.
In practice
- Use Lean as a process-level reward oracle.
- Apply first-error propagation in RL objectives.
- Integrate tactic-level feedback for proof generation.
Topics
- Reinforcement Learning
- Theorem Proving
- Lean Proof Assistant
- Formal Verification
- Language Models
- Tactic-Level Feedback
Best for: AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.