Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models
Summary
On-policy distillation (OPD) is a training method where student models learn from stronger teachers using data generated by the student's own distribution. A critical failure mode identified in OPD is "length inflation," where on-policy rollouts abruptly increase in length during training, leading to truncated trajectories dominating the training data. This issue correlates with repetition saturation and results in biased gradient signals, causing significant training instability and a sharp decline in validation performance. Researchers attribute this to the interplay between student-induced data collection and the distillation objective, which inadvertently promotes long, repetitive rollouts. To counter this, StableOPD was developed, integrating a reference-based divergence constraint and rollout mixture distillation. This framework effectively mitigates repetition-induced length inflation, stabilizes OPD training, and has demonstrated an average performance improvement of 7.2% across several math reasoning datasets.
Key takeaway
For AI Engineers developing or deploying large language models with on-policy distillation, you should consider implementing StableOPD's techniques. Adopting a reference-based divergence constraint and rollout mixture distillation can prevent training instability and performance degradation caused by length inflation and repetition saturation, potentially improving model performance by over 7% on reasoning tasks.
Key insights
On-policy distillation (OPD) suffers from length inflation and instability, which StableOPD addresses with divergence constraints and mixture distillation.
Principles
- Student-induced data can bias distillation objectives.
- Repetition saturation correlates with length inflation.
- Stabilizing rollouts improves model performance.
Method
StableOPD combines a reference-based divergence constraint with rollout mixture distillation to prevent truncation collapse and stabilize on-policy distillation training.
In practice
- Implement divergence constraints in distillation.
- Monitor rollout length and repetition during training.
- Apply rollout mixture distillation for stability.
Topics
- On-policy Distillation
- Large Language Models
- Length Inflation
- StableOPD
- Rollout Mixture Distillation
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.