Evaluation of LLMs for Mathematical Formalization in Lean

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, extended

Summary

This study evaluates various Large Language Models (LLMs) for formal mathematical proof generation within the Lean 4 theorem proving system. Researchers compared models using pass@k and refine@k metrics on 50-problem subsets of the miniF2F and miniCTX datasets. Results indicate Gemini 3.1 Pro and Claude Opus 4.7 are top performers, with Gemini 3.1 Pro achieving 92% success on miniF2F via refine@32 and Opus 4.7 reaching 86% on miniCTX via refine@32. Considering cost, NVIDIA Nemotron 3 Super and GPT-OSS 120B proved most efficient, costing less than \$0.01 per correct proof. The analysis also found that iterative refinement strategies generally improved performance by an average of +3.14% across models, highlighting the value of feedback loops.

Key takeaway

For research scientists developing formal proof generation systems, you should prioritize LLMs like Gemini 3.1 Pro or Claude Opus 4.7 for their superior accuracy in Lean 4. If budget is a constraint, consider NVIDIA Nemotron 3 Super or GPT-OSS 120B, which offer competitive performance at significantly lower costs. Implement iterative refinement strategies, as they consistently improve proof success by leveraging compiler feedback.

Key insights

Iterative refinement significantly boosts LLM performance in formal mathematical proof generation.

Principles

Method

The study used a standardized zero-shot prompting paradigm with a temperature of 0.5. It employed pass@k (32 independent generations) and refine@k (up to 32 iterative refinements with Lean verifier feedback) for evaluation.

In practice

Topics

Code references

Best for: AI Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.