TheoremBench: Evaluating LLMs on Theorem Proving in Formal Mathematics

2026-06-08 · Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences, Research Methodology & Innovation · Depth: Expert, quick

Summary

TheoremBench is a new Lean4 benchmark designed to evaluate Large Language Models (LLMs) on theorem proving in formal mathematics, moving beyond competition-style problems. Built from nearly one hundred classical theorems, it offers a plain version with one target theorem per instance and a premised version that includes automatically extracted supporting subtheorems. This dual design allows for assessing both final theorem proof and partial progress through internal proof structures. Experiments reveal that explicit premises substantially improve performance for Lean4-capable prover models. The benchmark also introduces theorem-level coverage and token-efficiency metrics, which expose that current provers favor easy subtheorems and often rely on long, inefficient tactic traces instead of compact proof plans. TheoremBench provides a more fine-grained view of formal reasoning ability.

Key takeaway

For AI Scientists developing or evaluating LLMs for formal mathematics, recognize that traditional competition-style benchmarks may not capture true reasoning ability. You should consider adopting structurally designed benchmarks like TheoremBench, especially its premised version, to gain a more accurate, fine-grained view of model performance. Prioritize developing models that generate compact proof plans rather than long tactic traces, as this indicates greater efficiency and deeper understanding in formal proving tasks.

Key insights

TheoremBench evaluates LLMs on formal math proving, revealing performance gains with explicit premises and current prover inefficiencies.

Principles

Structural benchmark design improves LLM evaluation.
Explicit premises enhance formal proving performance.
Prover efficiency requires compact proof plans.

Method

TheoremBench constructs a benchmark from classical theorems, offering plain and premised versions with automatically extracted subtheorems to evaluate full and partial proof progress.

In practice

Use premised benchmarks for LLM prover training.
Focus LLM development on compact proof plans.

Topics

LLM Evaluation
Theorem Proving
Formal Mathematics
Lean4
Benchmark Design
Proof Automation

Best for: AI Scientist, Research Scientist

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.