GraphPO: Graph-based Policy Optimization for Reasoning Models

2026-06-18 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences · Depth: Expert, extended

Summary

GraphPO (Graph-based Policy Optimization) is a novel reinforcement learning framework designed to enhance large reasoning models by addressing the limitations of traditional Reinforcement Learning with Verifiable Rewards (RLVR) and tree-based methods. RLVR often suffers from sparse final-answer rewards and redundant exploration due to independent response sampling. Tree-based methods improve prefix sharing but still treat divergent branches independently, leading to repeated exploration of semantically similar states. GraphPO represents rollouts as a directed acyclic graph, merging semantically equivalent reasoning paths into equivalence classes. This approach allows sharing suffixes, reallocating computational budget from redundant expansions to diverse exploration, and assigning dual advantages (efficiency and correctness) to improve inference efficiency and derive process supervision. Experiments on three LLMs, including Qwen2.5-7B-Math, across reasoning and agentic search benchmarks, demonstrate that GraphPO consistently outperforms chain- and tree-based baselines with equivalent token or response budgets. The method uses Qwen2.5-7B-Instruct for state summarization and SFR-Embedding-2-R for similarity detection, with an optimal merging threshold near 0.92 and a pooling coefficient around 0.7.

Key takeaway

For Machine Learning Engineers optimizing large reasoning models, GraphPO offers a significant efficiency boost. If you are struggling with redundant exploration or sparse rewards in RLVR, consider implementing graph-based rollouts. This approach can reduce computational waste and provide denser, more accurate step-level supervision, leading to better performance with the same token budgets. You should explore its application in agentic search and mathematical reasoning tasks.

Key insights

GraphPO uses a graph-based RL framework to merge semantically similar reasoning states, reducing redundancy and improving exploration efficiency.

Principles

Merge semantically equivalent states.
Share suffixes for denser rewards.
Reallocate budget from redundant paths.

Method

GraphPO incrementally builds a DAG, embedding semantic states for equivalence detection via cosine similarity. It pools scores from equivalent nodes for step rewards and uses dual-group advantages for policy optimization.

In practice

Summarize states with Qwen2.5-7B-Instruct.
Detect similarity using SFR-Embedding-2-R.
Tune merging threshold κ around 0.92.

Topics

Graph-based RL
Large Reasoning Models
Policy Optimization
RL with Verifiable Rewards
Exploration Efficiency
Semantic State Merging

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.