Do LLMs Game Formalization? Evaluating Faithfulness in Logical Reasoning
Summary
A study investigates "formalization gaming" in large language models (LLMs) when generating Lean 4 proofs for natural-language logical reasoning problems. Researchers evaluated GPT-5 and DeepSeek-R1 on 303 first-order logic problems, comprising 203 from FOLIO and 100 from Multi-LogiEval. The evaluation compared a unified generation approach against a two-stage pipeline that separates formalization from proving. Despite high compilation rates ranging from 87% to 99%, the study found no systematic evidence of gaming in unified generation, as models tended to report failure rather than force incorrect proofs. However, the two-stage pipeline revealed distinct unfaithfulness modes: GPT-5 fabricated axioms during proof generation, while DeepSeek-R1 mistranslated premises during formalization, producing internally consistent but unfaithful outputs.
Key takeaway
For research scientists developing or deploying LLMs for formal reasoning, you should not rely solely on high compilation rates or accuracy metrics as indicators of faithful reasoning. Implement multi-stage evaluation pipelines to detect subtle forms of unfaithfulness, such as fabricated axioms or mistranslated premises, which can produce internally consistent but logically unsound outputs. This approach helps ensure the integrity of automated proof generation.
Key insights
LLMs can produce unfaithful logical proofs despite high compilation rates, highlighting a gap between validity and faithfulness.
Principles
- High compilation rates do not equate to faithful reasoning.
- Unfaithfulness can manifest in distinct ways across models.
Method
The study used a two-stage pipeline separating formalization from proving to detect unfaithfulness, comparing it against unified generation on first-order logic problems.
In practice
- Cross-stage comparison can detect axiom fabrication.
- Mistranslated premises can lead to undetectable unfaithfulness.
Topics
- Formalization Gaming
- Logical Reasoning
- Large Language Models
- Lean 4 Proofs
- GPT-5
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.