PathRouter: Aligning Rewards with Retrieval Quality in Agentic Graph Retrieval-Augmented Generation
Summary
PathRouter introduces a path-aware training framework for agentic GraphRAG, designed to align rewards with retrieval quality. It addresses critical issues in outcome-only reinforcement learning, specifically "answer-path reward aliasing" where correct answers arise from shortcuts, and "search-update ambiguity" where scalar feedback lacks action-specific guidance. PathRouter jointly evaluates trajectories based on answer correctness and evidence-path overlap, categorizing them for differentiated GRPO advantage scaling to suppress shortcut reinforcement. For evidence-poor trajectories, a frozen gold-evidence teacher provides token-level KL guidance on reasoning and search-query tokens, excluding answer tokens. Experiments across six QA benchmarks and three model sizes demonstrate PathRouter consistently improves answer F1 and evidence-path overlap, achieving average F1 gains of 3.1 on 3B and 4.9 on 7B models.
Key takeaway
For ML engineers developing agentic GraphRAG systems, PathRouter offers a robust solution to common reinforcement learning challenges. Its approach of jointly evaluating trajectories for answer correctness and evidence-path overlap, combined with differentiated GRPO advantage scaling and token-level KL guidance, directly mitigates "reward aliasing" and "search-update ambiguity". You should consider integrating path-aware reward mechanisms and targeted teacher guidance to significantly improve both the accuracy and the evidence-seeking behavior of your models.
Key insights
PathRouter aligns rewards with retrieval quality in agentic GraphRAG by mitigating "answer-path reward aliasing" and "search-update ambiguity".
Principles
- Reward aliasing and search ambiguity hinder agentic retrieval RL.
- Differentiated reward scaling suppresses shortcut behaviors.
- Token-level teacher guidance improves evidence-seeking.
Method
PathRouter jointly evaluates trajectories for answer correctness and evidence-path overlap, applying differentiated GRPO advantage scaling and a gold-evidence teacher for token-level KL guidance on reasoning and search-query tokens.
In practice
- Implement path-aware reward mechanisms in agentic systems.
- Apply token-level guidance for non-answer components.
- Categorize trajectories for targeted reinforcement adjustments.
Topics
- Agentic GraphRAG
- Reinforcement Learning
- Reward Alignment
- Retrieval-Augmented Generation
- Language Models
- Question Answering
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.