Behavior Cloning is Not All You Need: The Optimality of On-Policy Distillation for Noisy Expert Feedback
Summary
A new paper, "Behavior Cloning is Not All You Need: The Optimality of On-Policy Distillation for Noisy Expert Feedback," published on 2026-06-29, investigates why online Imitation Learning (IL) methods, such as on-policy distillation (OPD), frequently outperform offline methods like supervised fine-tuning (SFT) in language model training. The authors introduce a noisy expert model where a learner accesses an imperfect expert policy but aims for the reward of a clean expert. Their findings indicate that offline learning from noisy trajectories is fundamentally difficult, requiring exponential sample complexity to match a clean expert. Conversely, online interaction with a noisy expert, using a novel OPD variant, achieves polynomial dependence on the horizon. The research also identifies a specific expert noise distribution condition that allows for horizon-free sample complexity, though with reduced statistical efficiency. This analysis offers a theoretical basis for OPD's empirical superiority over SFT when training language models with imperfect teachers.
Key takeaway
For Machine Learning Engineers training language models with imperfect or noisy expert data, prioritize on-policy distillation (OPD) over supervised fine-tuning (SFT). Your decision to use online methods is theoretically justified, as OPD demonstrates superior sample complexity with noisy experts. Implement OPD variants or alternative loss functions from this analysis to improve model performance and efficiency when expert feedback is less than ideal.
Key insights
Online on-policy distillation (OPD) is theoretically superior to offline methods for Imitation Learning with noisy expert feedback, especially for language models.
Principles
- Offline learning from noisy trajectories demands exponential sample complexity.
- Online interaction via OPD enables polynomial horizon dependence.
- Expert noise distribution can enable horizon-free sample complexity.
Method
A novel variant of on-policy distillation (OPD) is proposed for online interaction with noisy experts, proving polynomial dependence on the horizon in general.
In practice
- Prefer on-policy distillation for LMs with imperfect teachers.
- Consider alternative loss functions for LM training.
- Account for expert noise in imitation learning designs.
Topics
- Imitation Learning
- On-Policy Distillation
- Supervised Fine-Tuning
- Language Models
- Noisy Expert Feedback
- Sample Complexity
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.