CAPF: Guiding Search-Agent Rollouts with Credit-Attenuated Privileged Feedback

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

Credit-Attenuated Privileged Feedback (CAPF) is a novel training mechanism designed to enhance LLM search agents that utilize reinforcement learning with verifiable rewards (RLVR). These agents often struggle with complex problems due to the scarcity of successful end-to-end rollouts, resulting in limited positive-reward trajectories for learning. CAPF addresses this by introducing a training-time "Privileged Feedback call" that leverages verifier-side information to identify errors or omissions in the agent's submitted answers. This feedback enables the policy to revise initial zero-reward attempts into positive-reward "repair trajectories." To ensure deployability without this training-specific call, CAPF attenuates credit for the feedback interaction and preceding actions. Empirical research demonstrated that CAPF improved the Qwen3-4B model's average exact-match score from 44.7% under outcome-only RLVR to 48.5% across seven open-domain QA benchmarks.

Key takeaway

For Machine Learning Engineers developing LLM search agents that struggle with complex, low-reward problems, consider integrating Credit-Attenuated Privileged Feedback (CAPF) into your training pipeline. This approach allows your agent to learn from verifier-identified errors during training, converting failed attempts into successful repair trajectories. You can expect improved performance, as demonstrated by Qwen3-4B's 3.8 percentage point gain in exact-match scores, while ensuring your models remain deployable without the training-time feedback mechanism.

Key insights

CAPF uses verifier feedback during training to improve LLM search agent performance on hard problems.

Principles

Method

CAPF makes verifier-side error information available via a Privileged Feedback call during training, enabling policy revision of zero-reward attempts into repair trajectories, with credit attenuation.

In practice

Topics

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.