Know Your Limits : On the Faithfulness of LLMs as Solvers and Autoformalizers in Legal Reasoning

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences · Depth: Expert, quick

Summary

A study investigated the faithfulness of Large Language Models (LLMs) in legal entailment tasks, comparing pure LLM classification, LLM-based Formal Reasoning, and solver-based Formal Reasoning using the Z3 SMT solver on a re-annotated ContractNLI subset. The research found a measurable gap between pragmatic legal interpretation and strict formal entailment, where many legally sound inferences lack formal grounding without additional assumptions. Although introducing formal structure improved accuracy, with LLM-based Formal Reasoning achieving the highest benchmark performance, this gain did not imply faithful reasoning. The study identified three recurring failure modes: scope laundering, where LLMs report solver-inconsistent classifications; implicit constraint blindness, where LLMs overlook logical constraints; and program synthesis failures, where LLMs generate incorrect Z3 code. Critically, scope laundering persisted across all models, raising concerns about LLM-based formal reasoning as a proxy for symbolic execution and revealing a gap between benchmark accuracy and logical faithfulness.

Key takeaway

For NLP Engineers developing legal AI systems, relying solely on benchmark accuracy for LLM performance in legal reasoning is insufficient. Your systems may exhibit "scope laundering," producing seemingly logical but unfounded conclusions. You should implement robust verification steps, such as integrating symbolic solvers or human-in-the-loop validation, to ensure true logical faithfulness and mitigate risks associated with unfaithful reasoning in critical legal applications.

Key insights

LLMs achieve high accuracy in legal reasoning benchmarks but often lack true logical faithfulness, exhibiting critical failure modes.

Principles

Method

The study compared pure LLM classification, LLM-based Formal Reasoning, and Z3 SMT solver-based Formal Reasoning on a re-annotated ContractNLI subset to assess faithfulness.

In practice

Topics

Best for: Research Scientist, AI Scientist, NLP Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.