TheoremBench: Evaluating LLMs on Theorem Proving in Formal Mathematics
Summary
TheoremBench is a new Lean4 benchmark designed to evaluate Large Language Models (LLMs) on theorem proving in formal mathematics, moving beyond competition-style problems. Built from nearly one hundred classical theorems, it offers a plain version with one target theorem per instance and a premised version that includes automatically extracted supporting subtheorems. This dual design allows for assessing both final theorem proof and partial progress through internal proof structures. Experiments reveal that explicit premises substantially improve performance for Lean4-capable prover models. The benchmark also introduces theorem-level coverage and token-efficiency metrics, which expose that current provers favor easy subtheorems and often rely on long, inefficient tactic traces instead of compact proof plans. TheoremBench provides a more fine-grained view of formal reasoning ability.
Key takeaway
For AI Scientists developing or evaluating LLMs for formal mathematics, recognize that traditional competition-style benchmarks may not capture true reasoning ability. You should consider adopting structurally designed benchmarks like TheoremBench, especially its premised version, to gain a more accurate, fine-grained view of model performance. Prioritize developing models that generate compact proof plans rather than long tactic traces, as this indicates greater efficiency and deeper understanding in formal proving tasks.
Key insights
TheoremBench evaluates LLMs on formal math proving, revealing performance gains with explicit premises and current prover inefficiencies.
Principles
- Structural benchmark design improves LLM evaluation.
- Explicit premises enhance formal proving performance.
- Prover efficiency requires compact proof plans.
Method
TheoremBench constructs a benchmark from classical theorems, offering plain and premised versions with automatically extracted subtheorems to evaluate full and partial proof progress.
In practice
- Use premised benchmarks for LLM prover training.
- Focus LLM development on compact proof plans.
Topics
- LLM Evaluation
- Theorem Proving
- Formal Mathematics
- Lean4
- Benchmark Design
- Proof Automation
Best for: AI Scientist, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.