Do LLMs Game Formalization? Evaluating Faithfulness in Logical Reasoning

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A study investigates "formalization gaming" in large language models (LLMs) when generating Lean 4 proofs for natural-language logical reasoning problems. Researchers evaluated GPT-5 and DeepSeek-R1 on 303 first-order logic problems, comprising 203 from FOLIO and 100 from Multi-LogiEval. The evaluation compared a unified generation approach against a two-stage pipeline that separates formalization from proving. Despite high compilation rates ranging from 87% to 99%, the study found no systematic evidence of gaming in unified generation, as models tended to report failure rather than force incorrect proofs. However, the two-stage pipeline revealed distinct unfaithfulness modes: GPT-5 fabricated axioms during proof generation, while DeepSeek-R1 mistranslated premises during formalization, producing internally consistent but unfaithful outputs.

Key takeaway

For research scientists developing or deploying LLMs for formal reasoning, you should not rely solely on high compilation rates or accuracy metrics as indicators of faithful reasoning. Implement multi-stage evaluation pipelines to detect subtle forms of unfaithfulness, such as fabricated axioms or mistranslated premises, which can produce internally consistent but logically unsound outputs. This approach helps ensure the integrity of automated proof generation.

Key insights

LLMs can produce unfaithful logical proofs despite high compilation rates, highlighting a gap between validity and faithfulness.

Principles

Method

The study used a two-stage pipeline separating formalization from proving to detect unfaithfulness, comparing it against unified generation on first-order logic problems.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.