Evaluation of LLMs for Mathematical Formalization in Lean

2026-06-04 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, quick

Summary

A recent evaluation assesses the effectiveness of various Large Language Models (LLMs) in generating formal mathematical proofs within the Lean 4 theorem prover environment. The study, published on 2026-06-04, utilized both pass@k and refine@k metrics across subsets of the miniF2F and miniCTX datasets to benchmark performance. Findings indicate that Gemini 3.1 Pro and Claude Opus 4.7 demonstrated the highest overall success rates. Specifically, Gemini 3.1 Pro achieved a 92% success rate on miniF2F using refine@32, while Opus 4.7 reached an 86% success rate on miniCTX via refine@32. When considering cost-efficiency, NVIDIA Nemotron 3 Super and GPT-OSS 120B emerged as the most economical options, delivering competitive accuracies at an average cost of <\$0.01 per correct proof.

Key takeaway

For AI Scientists and Machine Learning Engineers developing formal mathematical proofs in Lean 4, you should prioritize evaluating Gemini 3.1 Pro or Claude Opus 4.7 for their superior accuracy. If cost is a primary concern, consider integrating NVIDIA Nemotron 3 Super or GPT-OSS 120B, as they offer competitive performance at under \$0.01 per proof. Implement refine@k strategies to significantly enhance your LLM's proof generation success rates.

Key insights

LLMs like Gemini 3.1 Pro and Claude Opus 4.7 show high effectiveness in generating formal Lean 4 proofs, with some models offering cost efficiency.

Principles

LLM performance in formal proof generation varies significantly.
Refinement strategies improve LLM proof success rates.
Cost-effectiveness is a key differentiator for LLM selection.

Method

The study benchmarked LLMs using pass@k and refine@k metrics on miniF2F and miniCTX datasets to compare formal proof generation in Lean 4.

In practice

Use Gemini 3.1 Pro or Claude Opus 4.7 for high-accuracy Lean 4 proofs.
Consider NVIDIA Nemotron 3 Super for cost-optimized proof generation.
Apply refine@k strategies to boost LLM proof success.

Topics

Large Language Models
Lean 4 Theorem Prover
Formal Proof Generation
Model Evaluation
Cost-Efficient AI
Gemini 3.1 Pro

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.