GraphPO: Graph-based Policy Optimization for Reasoning Models

2026-06-17 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

GraphPO (Graph-based Policy Optimization) is a novel reinforcement learning framework designed to enhance large reasoning models by addressing limitations in standard Reinforcement Learning with Verifiable Rewards (RLVR) and tree-based methods. RLVR often suffers from redundant exploration due to independent response sampling and sparse final-answer rewards. While tree-based methods improve fine-grained signals, they still expand branches independently, leading to repeated exploration when different branches reach similar states. GraphPO represents rollouts as a directed acyclic graph (DAG), merging semantically equivalent reasoning paths to share suffixes and reallocate computational budget from redundant expansions to diverse exploration. It assigns efficiency advantages to incoming edges and correctness advantages to outgoing edges, improving inference efficiency and process supervision. Theory indicates GraphPO reduces advantage-estimation variance and enhances reasoning efficiency. Experiments on three LLMs across reasoning and agentic search benchmarks show it consistently outperforms chain- and tree-based baselines with identical token or response budgets.

Key takeaway

For Machine Learning Engineers optimizing large reasoning models, GraphPO offers a significant advancement over traditional RLVR and tree-based approaches. You should consider evaluating GraphPO to reduce redundant exploration and improve advantage estimation by leveraging its DAG-based rollout representation. This framework can enhance reasoning efficiency and provide better process supervision, potentially leading to more robust and computationally efficient model training for complex tasks.

Key insights

GraphPO uses directed acyclic graphs to merge reasoning paths, reducing redundancy and improving RL for reasoning models.

Principles

Independent response sampling in RLVR causes redundant exploration and sparse rewards.
Tree-based methods improve signals but still expand branches independently, repeating exploration.
Merging semantically equivalent reasoning paths in a DAG reduces redundancy and enhances diverse exploration.

Method

GraphPO represents rollouts as a directed acyclic graph (DAG) with reasoning steps as edges and semantic states as nodes. It merges equivalent paths to share suffixes, reallocates budget, and assigns efficiency/correctness advantages.

In practice

Enhancing large reasoning models' capabilities.
Improving performance on agentic search benchmarks.
Reducing computational waste in RL-based reasoning.

Topics

Reinforcement Learning
Policy Optimization
Reasoning Models
Directed Acyclic Graphs
Large Language Models
Agentic Search

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.