Learning from Language Feedback via Variational Policy Distillation
Summary
Variational Policy Distillation (VPD) is a novel framework designed to enhance reinforcement learning from verifiable rewards (RLVR) by addressing the issue of sparse outcome signals in complex reasoning tasks. Unlike previous on-policy self-distillation methods that use a fixed teacher to interpret language feedback, VPD formalizes learning as a Variational Expectation-Maximization (EM) problem, allowing both the teacher and student policies to co-evolve. In the E-step, the teacher policy is actively refined using an adaptive trust-region update based on trajectory outcomes, converting textual feedback into a dynamically improved target token distribution. The M-step then enables the student policy to internalize this dense distributional guidance from its own on-policy rollouts. This continuous improvement of the teacher's ability to extract actionable signals from textual critique allows VPD to consistently outperform standard RLVR and existing self-distillation baselines on scientific reasoning and code generation tasks.
Key takeaway
For research scientists developing reinforcement learning agents for complex reasoning or code generation, VPD offers a robust method to overcome sparse reward signals. By dynamically refining a teacher policy to interpret language feedback, VPD provides dense, token-level supervision that significantly improves learning. You should consider integrating VPD's co-evolutionary approach to enhance your models' performance beyond traditional RLVR or static self-distillation techniques, especially in cold-start regimes or rigid mathematical reasoning tasks.
Key insights
VPD co-evolves teacher and student policies to overcome sparse rewards in RL via dynamic language feedback interpretation.
Principles
- Co-evolving policies improve learning from feedback.
- Adaptive trust-region updates refine teacher policies.
- Dense distributional guidance aids student internalization.
Method
VPD uses a Variational EM approach: the E-step refines the teacher via adaptive trust-region updates on trajectory outcomes, translating textual feedback into a target token distribution; the M-step distills this guidance to the student on its on-policy rollouts.
In practice
- Apply VPD for scientific reasoning tasks.
- Use VPD for improved code generation.
- Consider VPD for complex RL environments.
Topics
- Variational Policy Distillation
- Reinforcement Learning from Language Feedback
- Expectation-Maximization
- Scientific Reasoning
- Code Generation
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.