SPADER: Step-wise Peer Advantage with Diversity-Aware Exploration Rewards for Multi-Answer Question Answering
Summary
SPADER is a novel reinforcement learning framework designed for long-horizon tool use in Multi-Answer Question Answering (Multi-Answer QA) with large language models. It addresses challenges in fine-grained credit assignment over extended search trajectories and reward alignment for comprehensive exploration. The framework integrates two core components: Step-wise Peer Advantage (SPA), a critic-free mechanism that assigns step-level credit by aligning parallel trajectories and estimating advantages from peer returns; and a diversity-aware exploration reward, which promotes the discovery of long-tail entities by upweighting rare findings and downweighting redundant ones. Experiments on datasets including QAMPARI, Mintaka, WebQSP, and QUEST demonstrate that SPADER significantly improves recall and overall F1 scores compared to existing prompting-based agents, outcome-supervised RL methods, and other step-level supervision approaches. The code and model weights are publicly available.
Key takeaway
For Machine Learning Engineers developing tool-augmented LLM agents for complex information retrieval, SPADER offers a robust framework to significantly improve Multi-Answer QA performance. You should consider integrating its Step-wise Peer Advantage and diversity-aware exploration rewards to enhance fine-grained credit assignment and ensure comprehensive discovery of both common and rare entities, thereby boosting recall and overall F1 scores in your applications.
Key insights
SPADER enhances Multi-Answer QA by using step-wise peer advantage and diversity-aware rewards for comprehensive, long-tail entity discovery.
Principles
- Fine-grained credit assignment improves long-horizon tool use.
- Diversity-aware rewards promote discovering rare entities.
- Peer returns can estimate advantages without a critic.
Method
SPADER employs Step-wise Peer Advantage (SPA) for step-level credit assignment via parallel trajectory alignment and peer return advantage estimation, combined with a diversity-aware reward for long-tail entity discovery.
In practice
- Apply SPADER to improve recall in Multi-Answer QA.
- Use diversity rewards for long-tail entity discovery.
- Implement critic-free credit assignment with peer returns.
Topics
- Multi-Answer QA
- Reinforcement Learning
- Large Language Models
- Tool-Augmented Agents
- Credit Assignment
- Diversity-Aware Exploration
Code references
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.