IG-Search: Step-Level Information Gain Rewards for Search-Augmented Reasoning

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

IG-Search is a new reinforcement learning framework designed to improve search-augmented reasoning in large language models by introducing a step-level reward based on Information Gain (IG). Unlike existing methods that use trajectory-level rewards, IG-Search measures how much retrieved documents enhance the model's confidence in the correct answer, compared to a random document baseline, for each search step. This fine-grained signal is fed back to search-query tokens using per-token advantage modulation in GRPO, allowing for precise credit assignment within a rollout. The framework does not require external intermediate annotations, relying instead on the policy's own generation probabilities. Experiments across seven single-hop and multi-hop QA benchmarks show IG-Search, using Qwen2.5-3B, achieved an average Exact Match (EM) of 0.430, outperforming MR-Search by 1.6 points and GiGPO by 0.9 points, especially on multi-hop tasks. Training wall-clock time increased by only ~6.4% per step, with no change to inference latency.

Key takeaway

For AI Engineers developing search-augmented reasoning systems, IG-Search offers a method to significantly improve model performance, particularly on multi-hop tasks, without substantial increases in training time or inference latency. You should consider implementing step-level Information Gain rewards to refine search query generation and enhance the precision of credit assignment within your reinforcement learning pipelines, especially when external intermediate annotations are impractical.

Key insights

IG-Search uses step-level Information Gain rewards to refine search queries in LLMs, improving reasoning without extra annotations.

Principles

Method

IG-Search calculates Information Gain for each search step by comparing confidence with retrieved documents against a random baseline, then applies this signal via per-token advantage modulation in GRPO.

In practice

Topics

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.