Behavior Cloning is Not All You Need: The Optimality of On-Policy Distillation for Noisy Expert Feedback

2026-06-29 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences · Depth: Expert, quick

Summary

A new paper, "Behavior Cloning is Not All You Need: The Optimality of On-Policy Distillation for Noisy Expert Feedback," published on 2026-06-29, investigates why online Imitation Learning (IL) methods, such as on-policy distillation (OPD), frequently outperform offline methods like supervised fine-tuning (SFT) in language model training. The authors introduce a noisy expert model where a learner accesses an imperfect expert policy but aims for the reward of a clean expert. Their findings indicate that offline learning from noisy trajectories is fundamentally difficult, requiring exponential sample complexity to match a clean expert. Conversely, online interaction with a noisy expert, using a novel OPD variant, achieves polynomial dependence on the horizon. The research also identifies a specific expert noise distribution condition that allows for horizon-free sample complexity, though with reduced statistical efficiency. This analysis offers a theoretical basis for OPD's empirical superiority over SFT when training language models with imperfect teachers.

Key takeaway

For Machine Learning Engineers training language models with imperfect or noisy expert data, prioritize on-policy distillation (OPD) over supervised fine-tuning (SFT). Your decision to use online methods is theoretically justified, as OPD demonstrates superior sample complexity with noisy experts. Implement OPD variants or alternative loss functions from this analysis to improve model performance and efficiency when expert feedback is less than ideal.

Key insights

Online on-policy distillation (OPD) is theoretically superior to offline methods for Imitation Learning with noisy expert feedback, especially for language models.

Principles

Offline learning from noisy trajectories demands exponential sample complexity.
Online interaction via OPD enables polynomial horizon dependence.
Expert noise distribution can enable horizon-free sample complexity.

Method

A novel variant of on-policy distillation (OPD) is proposed for online interaction with noisy experts, proving polynomial dependence on the horizon in general.

In practice

Prefer on-policy distillation for LMs with imperfect teachers.
Consider alternative loss functions for LM training.
Account for expert noise in imitation learning designs.

Topics

Imitation Learning
On-Policy Distillation
Supervised Fine-Tuning
Language Models
Noisy Expert Feedback
Sample Complexity

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.