GraphPO: Graph-based Policy Optimization for Reasoning Models
Summary
GraphPO (Graph-based Policy Optimization) is a novel reinforcement learning framework designed to enhance large reasoning models by addressing the limitations of traditional Reinforcement Learning with Verifiable Rewards (RLVR) and tree-based methods. RLVR often suffers from sparse final-answer rewards and redundant exploration due to independent response sampling. Tree-based methods improve prefix sharing but still treat divergent branches independently, leading to repeated exploration of semantically similar states. GraphPO represents rollouts as a directed acyclic graph, merging semantically equivalent reasoning paths into equivalence classes. This approach allows sharing suffixes, reallocating computational budget from redundant expansions to diverse exploration, and assigning dual advantages (efficiency and correctness) to improve inference efficiency and derive process supervision. Experiments on three LLMs, including Qwen2.5-7B-Math, across reasoning and agentic search benchmarks, demonstrate that GraphPO consistently outperforms chain- and tree-based baselines with equivalent token or response budgets. The method uses Qwen2.5-7B-Instruct for state summarization and SFR-Embedding-2-R for similarity detection, with an optimal merging threshold near 0.92 and a pooling coefficient around 0.7.
Key takeaway
For Machine Learning Engineers optimizing large reasoning models, GraphPO offers a significant efficiency boost. If you are struggling with redundant exploration or sparse rewards in RLVR, consider implementing graph-based rollouts. This approach can reduce computational waste and provide denser, more accurate step-level supervision, leading to better performance with the same token budgets. You should explore its application in agentic search and mathematical reasoning tasks.
Key insights
GraphPO uses a graph-based RL framework to merge semantically similar reasoning states, reducing redundancy and improving exploration efficiency.
Principles
- Merge semantically equivalent states.
- Share suffixes for denser rewards.
- Reallocate budget from redundant paths.
Method
GraphPO incrementally builds a DAG, embedding semantic states for equivalence detection via cosine similarity. It pools scores from equivalent nodes for step rewards and uses dual-group advantages for policy optimization.
In practice
- Summarize states with Qwen2.5-7B-Instruct.
- Detect similarity using SFR-Embedding-2-R.
- Tune merging threshold κ around 0.92.
Topics
- Graph-based RL
- Large Reasoning Models
- Policy Optimization
- RL with Verifiable Rewards
- Exploration Efficiency
- Semantic State Merging
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.