When Context Returns: Toward Robust Internalization in On-Policy Distillation

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Recent research identifies a critical issue in on-policy distillation: while internalizing privileged context improves a student model's no-context performance, reintroducing that context can paradoxically degrade its accuracy, a phenomenon termed "context-induced degradation." This work argues that robust internalization requires not only matching the teacher's context-conditioned behavior but also maintaining stability when context is reintroduced, a property called "context removability." To address this, a lightweight consistency regularizer is proposed. This method anchors the student's no-context output via stop-gradient and penalizes deviations from context-conditioned output using forward KL divergence, requiring only one extra forward pass per training step. Across 12 configurations spanning diverse domains and model families, the method improves context-conditioned accuracy in most settings, reduces context-induced harm in 11 out of 12 settings, and eliminates response-length inflation.

Key takeaway

For Machine Learning Engineers deploying on-policy distilled models, be aware that reintroducing original context can degrade performance, even on previously correct instances. You should consider implementing the proposed lightweight consistency regularizer, which uses a stop-gradient anchored no-context output and forward KL divergence penalty. This approach effectively mitigates context-induced degradation and improves stability across diverse model families, ensuring robust internalization without performance penalties.

Key insights

Reintroducing privileged context to on-policy distilled models can degrade performance, a problem addressed by a new consistency regularizer ensuring context removability.

Principles

Method

A lightweight consistency regularizer anchors the student's no-context output via stop-gradient, then penalizes context-conditioned output deviation using forward KL divergence, requiring one extra forward pass.

In practice

Topics

Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.