Beyond Compilation: Evaluating Faithful Natural-Language-to-Lean Statement Formalization
Summary
A new evaluation protocol addresses the challenge of faithfully formalizing natural language statements into Lean, moving "Beyond Compilation." Traditional theorem-proving benchmarks often overlook the generation of the formal statement itself, where simple compilation checks can miss omitted hypotheses or semantic shifts. This study introduces a 400-entry graduate-level benchmark spanning real analysis, complex analysis, topology, and algebra. The protocol combines Lean compilation, cross-model semantic judging, and human expert calibration, revealing a significant 29.0-point gap between compilation success (89.5%) and consensus faithfulness (60.5%). Human audits confirm the metric's conservatism: 96.0% of consensus-positive outputs are human-confirmed faithful, while 82.4% of compile-pass consensus-negative outputs are human-confirmed semantic failures. A $2^3$ factorial design further decomposes interventions, showing elaboration feedback as the largest validity intervention, search improving grounding, and fine-tuned drafting being largely substitutable.
Key takeaway
For research scientists developing natural-language-to-Lean formalization systems, relying solely on compilation rates for evaluation is insufficient and misleading. You must integrate semantic judging and human expert calibration into your evaluation protocols to accurately measure faithfulness. This approach reveals a significant gap between formal validity and true semantic correctness, guiding more effective system improvements and ensuring more reliable formal proofs.
Key insights
Faithful formalization requires more than compilation; semantic evaluation is crucial for reliable theorem proving.
Principles
- Compilation alone is an insufficient validity check for formalization.
- Semantic judging and human calibration are vital for faithfulness metrics.
- Formal validity, proof competence, and faithful generation should be reported separately.
Method
A protocol combining Lean compilation, cross-model semantic judging, and human expert calibration on a 400-entry benchmark.
In practice
- Implement semantic judging alongside compilation for formalization tasks.
- Decompose formalization pipelines to identify intervention impacts.
Topics
- Natural Language Formalization
- Lean Theorem Prover
- Formal Verification
- Semantic Evaluation
- Proof Search Benchmarks
- Formalization Pipelines
Best for: AI Scientist, Research Scientist, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.