Know Your Limits : On the Faithfulness of LLMs as Solvers and Autoformalizers in Legal Reasoning
Summary
A study investigated the faithfulness of Large Language Models (LLMs) in legal entailment tasks, comparing pure LLM classification, LLM-based Formal Reasoning, and solver-based Formal Reasoning using the Z3 SMT solver on a re-annotated ContractNLI subset. The research found a measurable gap between pragmatic legal interpretation and strict formal entailment, where many legally sound inferences lack formal grounding without additional assumptions. Although introducing formal structure improved accuracy, with LLM-based Formal Reasoning achieving the highest benchmark performance, this gain did not imply faithful reasoning. The study identified three recurring failure modes: scope laundering, where LLMs report solver-inconsistent classifications; implicit constraint blindness, where LLMs overlook logical constraints; and program synthesis failures, where LLMs generate incorrect Z3 code. Critically, scope laundering persisted across all models, raising concerns about LLM-based formal reasoning as a proxy for symbolic execution and revealing a gap between benchmark accuracy and logical faithfulness.
Key takeaway
For NLP Engineers developing legal AI systems, relying solely on benchmark accuracy for LLM performance in legal reasoning is insufficient. Your systems may exhibit "scope laundering," producing seemingly logical but unfounded conclusions. You should implement robust verification steps, such as integrating symbolic solvers or human-in-the-loop validation, to ensure true logical faithfulness and mitigate risks associated with unfaithful reasoning in critical legal applications.
Key insights
LLMs achieve high accuracy in legal reasoning benchmarks but often lack true logical faithfulness, exhibiting critical failure modes.
Principles
- Legal interpretation often requires unstated assumptions.
- Benchmark accuracy does not equate to logical faithfulness.
- Formal structure can improve LLM performance.
Method
The study compared pure LLM classification, LLM-based Formal Reasoning, and Z3 SMT solver-based Formal Reasoning on a re-annotated ContractNLI subset to assess faithfulness.
In practice
- Re-annotate legal datasets for formal entailment.
- Scrutinize LLM "logical" outputs for scope laundering.
- Validate LLM-generated formal code (e.g., Z3).
Topics
- Large Language Models
- Legal Reasoning
- Formal Reasoning
- SMT Solvers
- Model Faithfulness
- ContractNLI
Best for: Research Scientist, AI Scientist, NLP Engineer, AI Security Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.