Process-Verified Reinforcement Learning for Theorem Proving via Lean

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences · Depth: Expert, quick

Summary

This work demonstrates that the Lean proof assistant itself can serve as a symbolic process oracle, supplying both outcome-level and fine-grained tactic-level verified feedback during training for reinforcement learning from verifiable rewards (RLVR). Proof attempts are parsed into tactic sequences, with Lean marking locally sound steps and the earliest failing step, yielding dense, verifier-grounded credit signals rooted in type theory. These structured rewards are incorporated into a GRPO-style reinforcement learning objective using first-error propagation and first-token credit methods. Experiments with STP-Lean and DeepSeek-Prover-V1.5 show that tactic-level supervision outperforms outcome-only baselines in most settings, delivering improvements on benchmarks such as MiniF2F and ProofNet. This approach positions symbolic proof assistants as process-level reward oracles during training, combining the scalability of language models with the reliability of symbolic verification for formal reasoning.

Key takeaway

For AI Scientists and Machine Learning Engineers developing automated theorem provers, you should integrate process-level feedback from symbolic proof assistants like Lean. This approach, using tactic-level supervision and first-error propagation, significantly enhances performance over outcome-only methods on benchmarks such as MiniF2F and ProofNet. Consider utilizing these fine-grained verification signals to improve the reliability and scalability of your formal reasoning systems.

Key insights

Lean proof assistant provides fine-grained, process-level verified feedback for reinforcement learning in theorem proving.

Principles

Method

Proof attempts are parsed into tactic sequences; Lean marks sound steps and the first error. These signals are incorporated into a GRPO-style RL objective with first-error propagation and first-token credit.

In practice

Topics

Best for: AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.