GraphPO: Graph-based Policy Optimization for Reasoning Models
Summary
GraphPO (Graph-based Policy Optimization) is a novel reinforcement learning framework designed to enhance large reasoning models by addressing limitations in standard Reinforcement Learning with Verifiable Rewards (RLVR) and tree-based methods. RLVR often suffers from redundant exploration due to independent response sampling and sparse final-answer rewards. While tree-based methods improve fine-grained signals, they still expand branches independently, leading to repeated exploration when different branches reach similar states. GraphPO represents rollouts as a directed acyclic graph (DAG), merging semantically equivalent reasoning paths to share suffixes and reallocate computational budget from redundant expansions to diverse exploration. It assigns efficiency advantages to incoming edges and correctness advantages to outgoing edges, improving inference efficiency and process supervision. Theory indicates GraphPO reduces advantage-estimation variance and enhances reasoning efficiency. Experiments on three LLMs across reasoning and agentic search benchmarks show it consistently outperforms chain- and tree-based baselines with identical token or response budgets.
Key takeaway
For Machine Learning Engineers optimizing large reasoning models, GraphPO offers a significant advancement over traditional RLVR and tree-based approaches. You should consider evaluating GraphPO to reduce redundant exploration and improve advantage estimation by leveraging its DAG-based rollout representation. This framework can enhance reasoning efficiency and provide better process supervision, potentially leading to more robust and computationally efficient model training for complex tasks.
Key insights
GraphPO uses directed acyclic graphs to merge reasoning paths, reducing redundancy and improving RL for reasoning models.
Principles
- Independent response sampling in RLVR causes redundant exploration and sparse rewards.
- Tree-based methods improve signals but still expand branches independently, repeating exploration.
- Merging semantically equivalent reasoning paths in a DAG reduces redundancy and enhances diverse exploration.
Method
GraphPO represents rollouts as a directed acyclic graph (DAG) with reasoning steps as edges and semantic states as nodes. It merges equivalent paths to share suffixes, reallocates budget, and assigns efficiency/correctness advantages.
In practice
- Enhancing large reasoning models' capabilities.
- Improving performance on agentic search benchmarks.
- Reducing computational waste in RL-based reasoning.
Topics
- Reinforcement Learning
- Policy Optimization
- Reasoning Models
- Directed Acyclic Graphs
- Large Language Models
- Agentic Search
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.