ReasoningFlow: Discourse Structures for Understanding LLM Reasoning Traces
Summary
ReasoningFlow is a novel framework designed to capture the non-linear discourse structures of Large Reasoning Model (LRM) traces as fine-grained directed acyclic graphs (DAGs). Developed through careful manual annotation of 31 traces (2.1k steps) with high inter-annotator agreement (Krippendorff's α>0.8), the schema defines 8 node types and 14 edge types. It was then scaled to automatic annotation of 1,260 traces (247.7k steps) from five models (Qwen2.5-32B-Inst, QwQ-32B, DeepSeek-V3, DeepSeek-R1, GPT-oss-120B) across math, science, and argumentation tasks. Analysis revealed that LRMs exhibit structurally similar traces despite different training, ReasoningFlow identifies diverse fine-grained behaviors like local verification and self-reflection, most erroneous steps do not contribute to final answers, and mechanistic causal dependencies do not align with language-level discourse structures.
Key takeaway
For MLOps Engineers and AI Scientists evaluating LRM performance or developing robust LLM applications, relying solely on stepwise error detection is insufficient. You should integrate discourse structure analysis, like ReasoningFlow, to understand how errors actually propagate or are corrected. This approach reveals that most LRM errors are unused or neglected, enabling more accurate faithfulness monitoring and targeted improvements to reasoning processes, especially for local verification mechanisms.
Key insights
ReasoningFlow maps LRM reasoning traces into fine-grained DAGs, uncovering diverse behaviors and the actual impact of errors.
Principles
- LRMs exhibit structurally similar reasoning traces across models and training data.
- Self-reflection sentiment in LRMs correlates with the quality of reflected reasoning steps.
- Most erroneous steps in LRM traces do not causally propagate to incorrect final answers.
Method
Develop an annotation schema with 8 node and 14 edge types, validate manually, then use an LLM-powered pipeline for node segmentation, classification, and edge detection/classification.
In practice
- Monitor fine-grained LRM reasoning behaviors like local verification and assumptions.
- Improve stepwise evaluation by tracking error propagation through discourse structures.
Topics
- ReasoningFlow
- LLM Reasoning Traces
- Discourse Structures
- Directed Acyclic Graphs
- LLM Evaluation
- Error Propagation
- Mechanistic Interpretability
Code references
Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.