Evaluation of LLMs for Mathematical Formalization in Lean
Summary
A recent evaluation assesses the effectiveness of various Large Language Models (LLMs) in generating formal mathematical proofs within the Lean 4 theorem prover environment. The study, published on 2026-06-04, utilized both pass@k and refine@k metrics across subsets of the miniF2F and miniCTX datasets to benchmark performance. Findings indicate that Gemini 3.1 Pro and Claude Opus 4.7 demonstrated the highest overall success rates. Specifically, Gemini 3.1 Pro achieved a 92% success rate on miniF2F using refine@32, while Opus 4.7 reached an 86% success rate on miniCTX via refine@32. When considering cost-efficiency, NVIDIA Nemotron 3 Super and GPT-OSS 120B emerged as the most economical options, delivering competitive accuracies at an average cost of <\$0.01 per correct proof.
Key takeaway
For AI Scientists and Machine Learning Engineers developing formal mathematical proofs in Lean 4, you should prioritize evaluating Gemini 3.1 Pro or Claude Opus 4.7 for their superior accuracy. If cost is a primary concern, consider integrating NVIDIA Nemotron 3 Super or GPT-OSS 120B, as they offer competitive performance at under \$0.01 per proof. Implement refine@k strategies to significantly enhance your LLM's proof generation success rates.
Key insights
LLMs like Gemini 3.1 Pro and Claude Opus 4.7 show high effectiveness in generating formal Lean 4 proofs, with some models offering cost efficiency.
Principles
- LLM performance in formal proof generation varies significantly.
- Refinement strategies improve LLM proof success rates.
- Cost-effectiveness is a key differentiator for LLM selection.
Method
The study benchmarked LLMs using pass@k and refine@k metrics on miniF2F and miniCTX datasets to compare formal proof generation in Lean 4.
In practice
- Use Gemini 3.1 Pro or Claude Opus 4.7 for high-accuracy Lean 4 proofs.
- Consider NVIDIA Nemotron 3 Super for cost-optimized proof generation.
- Apply refine@k strategies to boost LLM proof success.
Topics
- Large Language Models
- Lean 4 Theorem Prover
- Formal Proof Generation
- Model Evaluation
- Cost-Efficient AI
- Gemini 3.1 Pro
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.