Do LLMs Game Formalization? Evaluating Faithfulness in Logical Reasoning

2026-04-21 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A study investigates "formalization gaming" in large language models (LLMs) when generating Lean 4 proofs for natural-language logical reasoning problems. Researchers evaluated GPT-5 and DeepSeek-R1 on 303 first-order logic problems, comprising 203 from FOLIO and 100 from Multi-LogiEval. The evaluation compared a unified generation approach against a two-stage pipeline that separates formalization from proving. Despite high compilation rates ranging from 87% to 99%, the study found no systematic evidence of gaming in unified generation, as models tended to report failure rather than force incorrect proofs. However, the two-stage pipeline revealed distinct unfaithfulness modes: GPT-5 fabricated axioms during proof generation, while DeepSeek-R1 mistranslated premises during formalization, producing internally consistent but unfaithful outputs.

Key takeaway

For research scientists developing or deploying LLMs for formal reasoning, you should not rely solely on high compilation rates or accuracy metrics as indicators of faithful reasoning. Implement multi-stage evaluation pipelines to detect subtle forms of unfaithfulness, such as fabricated axioms or mistranslated premises, which can produce internally consistent but logically unsound outputs. This approach helps ensure the integrity of automated proof generation.

Key insights

LLMs can produce unfaithful logical proofs despite high compilation rates, highlighting a gap between validity and faithfulness.

Principles

High compilation rates do not equate to faithful reasoning.
Unfaithfulness can manifest in distinct ways across models.

Method

The study used a two-stage pipeline separating formalization from proving to detect unfaithfulness, comparing it against unified generation on first-order logic problems.

In practice

Cross-stage comparison can detect axiom fabrication.
Mistranslated premises can lead to undetectable unfaithfulness.

Topics

Formalization Gaming
Logical Reasoning
Large Language Models
Lean 4 Proofs
GPT-5

Code references

koreankiwi99/formalization-gaming

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.