Reinforcement Learning from Rich Feedback with Distributional DAgger

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, short

Summary

The paper "Reinforcement Learning from Rich Feedback with Distributional DAgger" (DistIL), submitted on arXiv:2606.05152, introduces a novel approach to reinforcement learning that leverages rich feedback beyond the typical single-bit verifiable rewards (RLVR). DistIL addresses the limitation of current RLVR methods, which often ignore valuable information like execution traces, tool outputs, and expert corrections. This new method employs a distributional variant of the DAgger imitation learning algorithm, utilizing a simple forward cross-entropy objective. This objective enables robust credit assignment by propagating future expert-student disagreement back to earlier decisions. Crucially, DistIL guarantees monotonic policy improvement and provides regret bounds, unlike prior RL with self-distillation objectives (e.g., reverse KL or Jensen-Shannon) which may increase probabilities on worse actions. Empirically, DistIL demonstrates superior performance over RLVR and self-distillation baselines across scientific reasoning, coding, and complex mathematical problem-solving domains.

Key takeaway

For Machine Learning Engineers developing reasoning models, consider integrating DistIL's approach when rich feedback sources like execution traces or expert corrections are available. Your current RLVR or self-distillation methods might be suboptimal, as DistIL guarantees monotonic policy improvement and empirically outperforms these baselines in domains like coding and scientific reasoning. This shift can lead to more robust and effective model training, improving metrics like Pass@N.

Key insights

DistIL uses a forward cross-entropy objective with rich feedback to guarantee monotonic policy improvement in reinforcement learning.

Principles

Method

DistIL employs a distributional DAgger variant, using a forward cross-entropy objective. The learner accesses an expert distribution on visited states, enabling sequence-level gradients for rich credit assignment and guaranteed monotonic policy improvement.

In practice

Topics

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.