CAPF: Guiding Search-Agent Rollouts with Credit-Attenuated Privileged Feedback
Summary
Credit-Attenuated Privileged Feedback (CAPF) is a novel training mechanism designed to enhance LLM search agents that utilize reinforcement learning with verifiable rewards (RLVR). These agents often struggle with complex problems due to the scarcity of successful end-to-end rollouts, resulting in limited positive-reward trajectories for learning. CAPF addresses this by introducing a training-time "Privileged Feedback call" that leverages verifier-side information to identify errors or omissions in the agent's submitted answers. This feedback enables the policy to revise initial zero-reward attempts into positive-reward "repair trajectories." To ensure deployability without this training-specific call, CAPF attenuates credit for the feedback interaction and preceding actions. Empirical research demonstrated that CAPF improved the Qwen3-4B model's average exact-match score from 44.7% under outcome-only RLVR to 48.5% across seven open-domain QA benchmarks.
Key takeaway
For Machine Learning Engineers developing LLM search agents that struggle with complex, low-reward problems, consider integrating Credit-Attenuated Privileged Feedback (CAPF) into your training pipeline. This approach allows your agent to learn from verifier-identified errors during training, converting failed attempts into successful repair trajectories. You can expect improved performance, as demonstrated by Qwen3-4B's 3.8 percentage point gain in exact-match scores, while ensuring your models remain deployable without the training-time feedback mechanism.
Key insights
CAPF uses verifier feedback during training to improve LLM search agent performance on hard problems.
Principles
- RLVR agents need more than outcome-only rewards.
- Verifier-side information can guide in-rollout revision.
- Attenuate training-time feedback for deployment.
Method
CAPF makes verifier-side error information available via a Privileged Feedback call during training, enabling policy revision of zero-reward attempts into repair trajectories, with credit attenuation.
In practice
- Apply verifier feedback to improve LLM agent training.
- Implement credit attenuation for deployable agents.
- Enhance Qwen3-4B performance on QA tasks.
Topics
- LLM Search Agents
- Reinforcement Learning
- Privileged Feedback
- Credit Attenuation
- Qwen3-4B
- Open-domain QA
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.