Reinforcement Learning from Rich Feedback with Distributional DAgger
Summary
The paper "Reinforcement Learning from Rich Feedback with Distributional DAgger" (DistIL), submitted on arXiv:2606.05152, introduces a novel approach to reinforcement learning that leverages rich feedback beyond the typical single-bit verifiable rewards (RLVR). DistIL addresses the limitation of current RLVR methods, which often ignore valuable information like execution traces, tool outputs, and expert corrections. This new method employs a distributional variant of the DAgger imitation learning algorithm, utilizing a simple forward cross-entropy objective. This objective enables robust credit assignment by propagating future expert-student disagreement back to earlier decisions. Crucially, DistIL guarantees monotonic policy improvement and provides regret bounds, unlike prior RL with self-distillation objectives (e.g., reverse KL or Jensen-Shannon) which may increase probabilities on worse actions. Empirically, DistIL demonstrates superior performance over RLVR and self-distillation baselines across scientific reasoning, coding, and complex mathematical problem-solving domains.
Key takeaway
For Machine Learning Engineers developing reasoning models, consider integrating DistIL's approach when rich feedback sources like execution traces or expert corrections are available. Your current RLVR or self-distillation methods might be suboptimal, as DistIL guarantees monotonic policy improvement and empirically outperforms these baselines in domains like coding and scientific reasoning. This shift can lead to more robust and effective model training, improving metrics like Pass@N.
Key insights
DistIL uses a forward cross-entropy objective with rich feedback to guarantee monotonic policy improvement in reinforcement learning.
Principles
- Rich feedback improves RL beyond single-bit rewards.
- Forward cross-entropy guarantees monotonic policy improvement.
- Prior reverse KL objectives lack monotonic improvement guarantees.
Method
DistIL employs a distributional DAgger variant, using a forward cross-entropy objective. The learner accesses an expert distribution on visited states, enabling sequence-level gradients for rich credit assignment and guaranteed monotonic policy improvement.
In practice
- Improves Pass@N scores.
- Effective for scientific reasoning tasks.
- Applicable to coding and complex math problems.
Topics
- Reinforcement Learning
- Imitation Learning
- Rich Feedback
- Distributional DAgger
- Policy Improvement
- Scientific Reasoning
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.