ReasoningFlow: Discourse Structures for Understanding LLM Reasoning Traces

2026-06-06 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

ReasoningFlow is a novel framework designed to capture the non-linear discourse structures of Large Reasoning Model (LRM) traces as fine-grained directed acyclic graphs (DAGs). Developed through careful manual annotation of 31 traces (2.1k steps) with high inter-annotator agreement (Krippendorff's α>0.8), the schema defines 8 node types and 14 edge types. It was then scaled to automatic annotation of 1,260 traces (247.7k steps) from five models (Qwen2.5-32B-Inst, QwQ-32B, DeepSeek-V3, DeepSeek-R1, GPT-oss-120B) across math, science, and argumentation tasks. Analysis revealed that LRMs exhibit structurally similar traces despite different training, ReasoningFlow identifies diverse fine-grained behaviors like local verification and self-reflection, most erroneous steps do not contribute to final answers, and mechanistic causal dependencies do not align with language-level discourse structures.

Key takeaway

For MLOps Engineers and AI Scientists evaluating LRM performance or developing robust LLM applications, relying solely on stepwise error detection is insufficient. You should integrate discourse structure analysis, like ReasoningFlow, to understand how errors actually propagate or are corrected. This approach reveals that most LRM errors are unused or neglected, enabling more accurate faithfulness monitoring and targeted improvements to reasoning processes, especially for local verification mechanisms.

Key insights

ReasoningFlow maps LRM reasoning traces into fine-grained DAGs, uncovering diverse behaviors and the actual impact of errors.

Principles

LRMs exhibit structurally similar reasoning traces across models and training data.
Self-reflection sentiment in LRMs correlates with the quality of reflected reasoning steps.
Most erroneous steps in LRM traces do not causally propagate to incorrect final answers.

Method

Develop an annotation schema with 8 node and 14 edge types, validate manually, then use an LLM-powered pipeline for node segmentation, classification, and edge detection/classification.

In practice

Monitor fine-grained LRM reasoning behaviors like local verification and assumptions.
Improve stepwise evaluation by tracking error propagation through discourse structures.

Topics

ReasoningFlow
LLM Reasoning Traces
Discourse Structures
Directed Acyclic Graphs
LLM Evaluation
Error Propagation
Mechanistic Interpretability

Code references

jinulee-v/reasoningflow

Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.