Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models

2026-04-09 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

On-policy distillation (OPD) is a training method where student models learn from stronger teachers using data generated by the student's own distribution. A critical failure mode identified in OPD is "length inflation," where on-policy rollouts abruptly increase in length during training, leading to truncated trajectories dominating the training data. This issue correlates with repetition saturation and results in biased gradient signals, causing significant training instability and a sharp decline in validation performance. Researchers attribute this to the interplay between student-induced data collection and the distillation objective, which inadvertently promotes long, repetitive rollouts. To counter this, StableOPD was developed, integrating a reference-based divergence constraint and rollout mixture distillation. This framework effectively mitigates repetition-induced length inflation, stabilizes OPD training, and has demonstrated an average performance improvement of 7.2% across several math reasoning datasets.

Key takeaway

For AI Engineers developing or deploying large language models with on-policy distillation, you should consider implementing StableOPD's techniques. Adopting a reference-based divergence constraint and rollout mixture distillation can prevent training instability and performance degradation caused by length inflation and repetition saturation, potentially improving model performance by over 7% on reasoning tasks.

Key insights

On-policy distillation (OPD) suffers from length inflation and instability, which StableOPD addresses with divergence constraints and mixture distillation.

Principles

Student-induced data can bias distillation objectives.
Repetition saturation correlates with length inflation.
Stabilizing rollouts improves model performance.

Method

StableOPD combines a reference-based divergence constraint with rollout mixture distillation to prevent truncation collapse and stabilize on-policy distillation training.

In practice

Implement divergence constraints in distillation.
Monitor rollout length and repetition during training.
Apply rollout mixture distillation for stability.

Topics

On-policy Distillation
Large Language Models
Length Inflation
StableOPD
Rollout Mixture Distillation

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.