Enhancing LLM-based Search Agents via Contribution Weighted Group Relative Policy Optimization

2026-04-15 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A new framework, Contribution-Weighted GRPO (CW-GRPO), enhances Large Language Model (LLM)-based search agents by integrating process supervision into group relative policy optimization. Existing reinforcement learning methods for these agents face challenges with unstable value estimation in process supervision and credit assignment in outcome supervision due to sparse, trajectory-level rewards. CW-GRPO addresses this by using an LLM judge to evaluate retrieval utility and reasoning correctness at each search round, generating per-round contribution scores. These scores then rescale outcome-based advantages along the trajectory, facilitating fine-grained credit assignment while maintaining optimization stability. Experiments on knowledge-intensive benchmarks demonstrate that CW-GRPO improves performance over standard GRPO by 5.0% on Qwen3-8B and 6.3% on Qwen3-1.7B, resulting in more effective search behaviors.

Key takeaway

For research scientists developing LLM-based search agents, CW-GRPO offers a robust method to overcome credit assignment challenges in reinforcement learning. You should consider integrating an LLM judge to generate per-round contribution scores, as this approach has shown significant performance gains (5.0% on Qwen3-8B, 6.3% on Qwen3-1.7B) and leads to more effective search behaviors compared to standard GRPO.

Key insights

CW-GRPO improves LLM search agents by using an LLM judge for fine-grained, per-round credit assignment.

Principles

Combine process and outcome supervision
Rescale advantages with contribution scores
Concentrated contributions indicate success

Method

CW-GRPO uses an LLM judge to assign per-round contribution scores based on retrieval utility and reasoning correctness. These scores then rescale outcome-based advantages for fine-grained credit assignment during policy optimization.

In practice

Implement LLM judges for search agents
Apply CW-GRPO to Qwen3-8B or Qwen3-1.7B
Analyze contribution concentration in trajectories

Topics

LLM Search Agents
Reinforcement Learning
Contribution-Weighted GRPO
Policy Optimization
LLM Judge

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.