IG-Search: Step-Level Information Gain Rewards for Search-Augmented Reasoning
Summary
IG-Search is a new reinforcement learning framework designed to improve search-augmented reasoning in large language models by introducing a step-level reward based on Information Gain (IG). Unlike existing methods that use trajectory-level rewards, IG-Search measures how much retrieved documents enhance the model's confidence in the correct answer, compared to a random document baseline, for each search step. This fine-grained signal is fed back to search-query tokens using per-token advantage modulation in GRPO, allowing for precise credit assignment within a rollout. The framework does not require external intermediate annotations, relying instead on the policy's own generation probabilities. Experiments across seven single-hop and multi-hop QA benchmarks show IG-Search, using Qwen2.5-3B, achieved an average Exact Match (EM) of 0.430, outperforming MR-Search by 1.6 points and GiGPO by 0.9 points, especially on multi-hop tasks. Training wall-clock time increased by only ~6.4% per step, with no change to inference latency.
Key takeaway
For AI Engineers developing search-augmented reasoning systems, IG-Search offers a method to significantly improve model performance, particularly on multi-hop tasks, without substantial increases in training time or inference latency. You should consider implementing step-level Information Gain rewards to refine search query generation and enhance the precision of credit assignment within your reinforcement learning pipelines, especially when external intermediate annotations are impractical.
Key insights
IG-Search uses step-level Information Gain rewards to refine search queries in LLMs, improving reasoning without extra annotations.
Principles
- Step-level rewards improve credit assignment.
- Information Gain quantifies search query effectiveness.
- Policy's own probabilities can generate supervision.
Method
IG-Search calculates Information Gain for each search step by comparing confidence with retrieved documents against a random baseline, then applies this signal via per-token advantage modulation in GRPO.
In practice
- Apply step-level rewards for complex tasks.
- Use IG to evaluate search query quality.
- Consider GRPO for fine-grained credit assignment.
Topics
- Reinforcement Learning
- Search-Augmented Reasoning
- Information Gain
- Large Language Models
- Question Answering
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.