Policy-Conditioned Counterfactual Credit for Verifiable Reinforcement Learning of Long-Horizon Language Agents

2026-06-06 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

CVT-RL is a constrained policy-gradient algorithm designed to improve long-horizon language agents by addressing issues like unsupported evidence chains, belief drift, and reward hacking. It introduces a policy-conditioned counterfactual contribution (PCCC) estimator, dense verifiable rewards, and intervention-validity gating. PCCC quantifies an action's impact on final verified success under specific interventions (deletion, semantic substitution, evidence substitution, tool-output perturbation) using a frozen reference policy and a selection-adjusted doubly robust estimator. Evaluated on long-context QA, ALFWorld, ScienceWorld, and web/tool tasks, CVT-RL increased average task success from 71.8% (compute-matched non-causal RL) and 75.4% (information-matched counterfactual-process baseline) to 78.9%. It also improved evidence F1 from 78.9 to 82.8 and reduced measured hacking from 7.2% to 3.9%, with human audits confirming 4.6% hacking versus 8.1% for the information-matched baseline.

Key takeaway

For AI Scientists and ML Engineers developing long-horizon language agents, consider integrating policy-conditioned counterfactual credit (PCCC) to enhance verifiability and reduce reward hacking. This approach significantly improves task success and evidence quality, even under adaptive attacks, by causally attributing credit to intermediate steps. Be mindful of the substantial compute overhead, but explore active selection to manage cost-accuracy trade-offs effectively in your agent development.

Key insights

Policy-conditioned counterfactual credit and validity gating enable verifiable, reliable long-horizon reinforcement learning for language agents.

Principles

Causal credit assignment improves agent reliability.
Separate intervention semantics for distinct contributions.
Validity gating reduces off-support counterfactuals.

Method

CVT-RL uses a constrained policy-gradient with dense verifiable rewards, intervention-validity gating, and a PCCC estimator. PCCC involves controlled interventions (deletion, substitution, perturbation) and a selection-adjusted doubly robust estimator with a frozen continuation policy.

In practice

Apply PCCC with specific intervention families.
Use validity gates to filter OOD counterfactuals.
Implement constrained trust-region updates for stability.

Topics

Reinforcement Learning
Language Agents
Causal Inference
Verifiable Rewards
Reward Hacking
Constrained Optimization
Large Language Models

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.