Blockwise Policy-Drift Gating for On-Policy Distillation
Summary
Blockwise Policy-Drift Gating introduces a lightweight student-only old-current drift controller designed to enhance On-Policy Distillation (OPD) for long-horizon reasoning tasks. This method addresses the fragility of sampled-token OPD by computing log-probability shifts between the behavior student and the current student on the sampled token path. These shifts are aggregated over fixed blocks, like 64-token spans, and then used as detached, mean-normalized gates to reweight OPD position losses without altering teacher targets or the rollout policy. Evaluated on a six-variant Qwen3 math reasoning benchmark with a uniform 200-step training budget, fixed 64-token block gating improved sampled-token OPD mean pass@8 from 0.4978 to 0.5160 across AIME24, AIME25, MATH500, and AMC23. The technique also yielded the best four-benchmark mean pass@8 among trained students on Teacher-TopK/LSM, demonstrating its effectiveness in improving solve-rate robustness.
Key takeaway
For Machine Learning Engineers optimizing On-Policy Distillation for long-horizon reasoning tasks, consider implementing blockwise policy-drift gating. This technique, particularly with 64-token blocks, can significantly improve your model's solve-rate robustness, as demonstrated by a mean pass@8 increase from 0.4978 to 0.5160 on math benchmarks. Integrating this lightweight student-only drift controller allows you to enhance sampled-token OPD without altering teacher targets or rollout policies, streamlining your distillation process.
Key insights
Blockwise policy-drift gating improves on-policy distillation robustness by controlling student policy divergence during rollout reuse.
Principles
- Local policy drift is a practical control signal.
- Block-level gating enhances solve-rate robustness.
- Rollout reuse benefits from drift control.
Method
Compute log-probability shifts between behavior and current student policies on sampled token paths, aggregate shifts over fixed blocks, then use mean-normalized gates to reweight On-Policy Distillation position losses.
In practice
- Apply 64-token block gating for OPD.
- Use drift control for long-horizon tasks.
- Improve Qwen3 math reasoning solve-rates.
Topics
- On-Policy Distillation
- Policy Drift Gating
- Long-Horizon Reasoning
- Qwen3 Benchmark
- Math Reasoning
- Model Robustness
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.