Quantifying Consistency in LLM Logical Reasoning via Structural Uncertainty
Summary
A new framework, "structural uncertainty," quantifies consistency in large language model (LLM) logical reasoning, addressing the issue of unstable or contradictory reasoning paths, especially in multi-step deductive tasks. Unlike existing methods that focus on output dispersion, this approach measures the model's ability to consistently rank its own competing reasoning candidates. The framework generates multiple solutions, asks the LLM for pairwise preferences, and aggregates these into ranking distributions using Bradley-Terry modeling with PageRank. It decomposes the signal into across-trial ranking instability and within-trial candidate ambiguity. Across five LLMs and eight benchmarks, structural signals complement answer dispersion, improving identification of unreliable instances in logical and mathematical reasoning. On factual retrieval, the signal collapses, indicating a regime boundary. Within-trial ambiguity positively correlates with correctness, while across-trial instability correlates negatively, signaling unreliable reasoning. Published on 2026-06-15, this is a regime-sensitive evaluator, not a universal confidence estimator.
Key takeaway
For AI Scientists or ML Engineers evaluating LLM reliability in complex reasoning tasks, you should integrate "structural uncertainty" metrics alongside traditional output dispersion to gain a more nuanced understanding of model consistency. This approach helps identify unreliable reasoning instances, particularly where multiple plausible paths remain competitive, improving diagnostic capabilities beyond simple correctness and informing model refinement strategies.
Key insights
Structural uncertainty quantifies LLM logical reasoning consistency by measuring the stability of self-preference-induced rankings over sampled solutions.
Principles
- LLM reasoning paths can be unstable even with correct answers.
- Ranking stability of self-generated solutions reveals consistency.
- Different consistency components correlate distinctly with accuracy.
Method
Generate multiple LLM solutions, have the LLM judge pairwise preferences among its outputs, aggregate preferences via Bradley-Terry modeling with PageRank, then decompose into two entropy-based components.
In practice
- Apply structural uncertainty to logical reasoning tasks.
- Use ranking instability to flag unreliable reasoning.
- Distinguish reasoning consistency from factual retrieval.
Topics
- Large Language Models
- LLM Reasoning
- Structural Uncertainty
- Consistency Quantification
- Bradley-Terry Model
- Model Evaluation
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.