Mechanistic Evidence for Faithfulness Decay in Chain-of-Thought Reasoning

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, extended

Summary

A new metric, Normalized Logit Difference Decay (NLDD), quantifies the causal influence of Chain-of-Thought (CoT) reasoning steps on Large Language Model predictions. This approach corrupts individual reasoning steps and measures the model's confidence drop, standardizing results for cross-model comparison. Evaluating DeepSeek-Coder-6.7B-Instruct, Llama-3.1-8B-Instruct, and Gemma-2-9B-Instruct across syntactic (Dyck-$n$), logical (PrOntoQA), and arithmetic (GSM8K) tasks, researchers identified a consistent "Reasoning Horizon ($k^{*}$)" at 70–85% of the chain length, beyond which reasoning tokens have negligible or negative impact. The study revealed "Faithful" regimes (Llama, DeepSeek) where CoT is causal, and an "Anti-Faithful" regime (Gemma), where Gemma achieved 99.0% accuracy on PrOntoQA with a negative NLDD of -52.5%, indicating corruption paradoxically increased confidence. A "Mapping Gap" was also observed, where models internally encode task-relevant information (e.g., 82.0% probe accuracy for Gemma on Dyck-$n$) but fail to utilize it for the final answer (0.0% task accuracy).

Key takeaway

For Machine Learning Engineers deploying LLMs with Chain-of-Thought, you should integrate NLDD to assess true reasoning faithfulness, not just accuracy. Your models might be "anti-faithful," achieving high performance without causally depending on their CoT, or exhibiting a "Mapping Gap." Use the identified Reasoning Horizon ($k^{*}$) to prune unnecessary reasoning steps, potentially improving efficiency or preventing negative interference from extended CoT, especially in critical applications.

Key insights

NLDD quantifies CoT faithfulness, revealing models can achieve high accuracy without causally relying on their generated reasoning.

Principles

Method

NLDD measures confidence degradation in logit space when individual CoT steps are corrupted, normalizing for cross-model comparison. It complements this with Representational Similarity Analysis (RSA) and Trajectory Alignment Score (TAS) to probe internal states.

In practice

Topics

Code references

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.