Learning from Language Feedback via Variational Policy Distillation

2026-05-14 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Variational Policy Distillation (VPD) is a novel framework designed to enhance reinforcement learning from verifiable rewards (RLVR) by addressing the issue of sparse outcome signals in complex reasoning tasks. Unlike previous on-policy self-distillation methods that use a fixed teacher to interpret language feedback, VPD formalizes learning as a Variational Expectation-Maximization (EM) problem, allowing both the teacher and student policies to co-evolve. In the E-step, the teacher policy is actively refined using an adaptive trust-region update based on trajectory outcomes, converting textual feedback into a dynamically improved target token distribution. The M-step then enables the student policy to internalize this dense distributional guidance from its own on-policy rollouts. This continuous improvement of the teacher's ability to extract actionable signals from textual critique allows VPD to consistently outperform standard RLVR and existing self-distillation baselines on scientific reasoning and code generation tasks.

Key takeaway

For research scientists developing reinforcement learning agents for complex reasoning or code generation, VPD offers a robust method to overcome sparse reward signals. By dynamically refining a teacher policy to interpret language feedback, VPD provides dense, token-level supervision that significantly improves learning. You should consider integrating VPD's co-evolutionary approach to enhance your models' performance beyond traditional RLVR or static self-distillation techniques, especially in cold-start regimes or rigid mathematical reasoning tasks.

Key insights

VPD co-evolves teacher and student policies to overcome sparse rewards in RL via dynamic language feedback interpretation.

Principles

Co-evolving policies improve learning from feedback.
Adaptive trust-region updates refine teacher policies.
Dense distributional guidance aids student internalization.

Method

VPD uses a Variational EM approach: the E-step refines the teacher via adaptive trust-region updates on trajectory outcomes, translating textual feedback into a target token distribution; the M-step distills this guidance to the student on its on-policy rollouts.

In practice

Apply VPD for scientific reasoning tasks.
Use VPD for improved code generation.
Consider VPD for complex RL environments.

Topics

Variational Policy Distillation
Reinforcement Learning from Language Feedback
Expectation-Maximization
Scientific Reasoning
Code Generation

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.