ReasoningFlow: Discourse Structures for Understanding LLM Reasoning Traces
Summary
ReasoningFlow is a new framework designed to capture the non-linear discourse structures within Large Reasoning Model (LRM) traces, such as backtracking and self-correction, by converting them into fine-grained directed acyclic graphs (DAGs). Researchers developed and validated an annotation schema through manual annotation of 31 traces (2.1k steps), achieving high inter-annotator agreement. This was then scaled to automatically annotate 1,260 traces (247.7k steps) across three tasks—math, science, and argumentation—and five models, including Qwen2.5-32B-Inst, QwQ-32B, DeepSeek-V3, DeepSeek-R1, and GPT-oss-120B. Analysis of these ReasoningFlow graphs revealed that LRMs exhibit structurally similar traces despite diverse training, diverse fine-grained reasoning behaviors (e.g., local verification, self-reflection), and that most erroneous steps do not contribute to final answers. The study also found that mechanistic causal dependencies do not reflect language-level discourse structure.
Key takeaway
For NLP Engineers evaluating Large Reasoning Models, understanding the internal reasoning process is crucial. You should consider using discourse structure analysis, like ReasoningFlow, to gain fine-grained insights into model behaviors such as self-correction and local verification. This approach helps you monitor reasoning traces more effectively and identify if erroneous steps actually influence final outputs, guiding your model refinement strategies.
Key insights
ReasoningFlow maps LRM traces to DAGs, revealing structural similarities and diverse fine-grained reasoning behaviors.
Principles
- LRMs show structurally similar reasoning traces.
- Erroneous steps often don't impact final LRM answers.
- Discourse structure differs from causal dependencies.
Method
ReasoningFlow captures LRM discourse structures as fine-grained directed acyclic graphs (DAGs) via manual schema validation and subsequent automatic annotation for large-scale analysis.
In practice
- Monitor LRM reasoning traces for specific behaviors.
- Evaluate LRM errors not contributing to final answers.
- Analyze LRM reasoning across diverse models.
Topics
- Large Reasoning Models
- Reasoning Traces
- Discourse Structures
- Directed Acyclic Graphs
- LLM Evaluation
- Model Monitoring
Code references
Best for: Research Scientist, AI Scientist, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.