Quantifying Consistency in LLM Logical Reasoning via Structural Uncertainty

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A new framework, "structural uncertainty," quantifies consistency in large language model (LLM) logical reasoning, addressing the issue of unstable or contradictory reasoning paths, especially in multi-step deductive tasks. Unlike existing methods that focus on output dispersion, this approach measures the model's ability to consistently rank its own competing reasoning candidates. The framework generates multiple solutions, asks the LLM for pairwise preferences, and aggregates these into ranking distributions using Bradley-Terry modeling with PageRank. It decomposes the signal into across-trial ranking instability and within-trial candidate ambiguity. Across five LLMs and eight benchmarks, structural signals complement answer dispersion, improving identification of unreliable instances in logical and mathematical reasoning. On factual retrieval, the signal collapses, indicating a regime boundary. Within-trial ambiguity positively correlates with correctness, while across-trial instability correlates negatively, signaling unreliable reasoning. Published on 2026-06-15, this is a regime-sensitive evaluator, not a universal confidence estimator.

Key takeaway

For AI Scientists or ML Engineers evaluating LLM reliability in complex reasoning tasks, you should integrate "structural uncertainty" metrics alongside traditional output dispersion to gain a more nuanced understanding of model consistency. This approach helps identify unreliable reasoning instances, particularly where multiple plausible paths remain competitive, improving diagnostic capabilities beyond simple correctness and informing model refinement strategies.

Key insights

Structural uncertainty quantifies LLM logical reasoning consistency by measuring the stability of self-preference-induced rankings over sampled solutions.

Principles

Method

Generate multiple LLM solutions, have the LLM judge pairwise preferences among its outputs, aggregate preferences via Bradley-Terry modeling with PageRank, then decompose into two entropy-based components.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.