Mechanistic Evidence for Faithfulness Decay in Chain-of-Thought Reasoning
Summary
A new metric, Normalized Logit Difference Decay (NLDD), quantifies the causal influence of Chain-of-Thought (CoT) reasoning steps on Large Language Model predictions. This approach corrupts individual reasoning steps and measures the model's confidence drop, standardizing results for cross-model comparison. Evaluating DeepSeek-Coder-6.7B-Instruct, Llama-3.1-8B-Instruct, and Gemma-2-9B-Instruct across syntactic (Dyck-$n$), logical (PrOntoQA), and arithmetic (GSM8K) tasks, researchers identified a consistent "Reasoning Horizon ($k^{*}$)" at 70–85% of the chain length, beyond which reasoning tokens have negligible or negative impact. The study revealed "Faithful" regimes (Llama, DeepSeek) where CoT is causal, and an "Anti-Faithful" regime (Gemma), where Gemma achieved 99.0% accuracy on PrOntoQA with a negative NLDD of -52.5%, indicating corruption paradoxically increased confidence. A "Mapping Gap" was also observed, where models internally encode task-relevant information (e.g., 82.0% probe accuracy for Gemma on Dyck-$n$) but fail to utilize it for the final answer (0.0% task accuracy).
Key takeaway
For Machine Learning Engineers deploying LLMs with Chain-of-Thought, you should integrate NLDD to assess true reasoning faithfulness, not just accuracy. Your models might be "anti-faithful," achieving high performance without causally depending on their CoT, or exhibiting a "Mapping Gap." Use the identified Reasoning Horizon ($k^{*}$) to prune unnecessary reasoning steps, potentially improving efficiency or preventing negative interference from extended CoT, especially in critical applications.
Key insights
NLDD quantifies CoT faithfulness, revealing models can achieve high accuracy without causally relying on their generated reasoning.
Principles
- CoT faithfulness varies significantly across LLM architectures.
- High accuracy does not guarantee causal reliance on CoT.
- Reasoning influence decays beyond a specific "horizon."
Method
NLDD measures confidence degradation in logit space when individual CoT steps are corrupted, normalizing for cross-model comparison. It complements this with Representational Similarity Analysis (RSA) and Trajectory Alignment Score (TAS) to probe internal states.
In practice
- Use NLDD to identify causally irrelevant CoT steps.
- Prune CoT chains beyond the detected Reasoning Horizon ($k^{*}$).
- Evaluate models for "anti-faithful" behavior, not just accuracy.
Topics
- Chain-of-Thought
- LLM Faithfulness
- NLDD Metric
- Reasoning Horizon
- Model Architectures
- Interpretability Methods
Code references
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.