Same Evidence, Different Answers: Canonical-Context On-Policy Distillation for Multi-Turn Language Models
Summary
Canonical-Context On-Policy Distillation (CCOPD) addresses a critical issue where large language models (LLMs) fail to maintain consistent performance when the same evidence is presented incrementally across multi-turn conversations, compared to a single, full prompt. This performance gap, termed "self-anchored drift," arises from unsupported assumptions introduced by partial information. CCOPD mitigates this by training a student LLM incrementally on multi-turn dialogues, aligning its behavior with a frozen teacher LLM that processes the complete, canonical prompt. Trained specifically on math problem conversations, CCOPD achieved a 32% average relative improvement in RAW-SHARDED performance over the original base model across math and five zero-shot out-of-domain task families, while largely preserving full-context performance. Analysis indicates CCOPD strengthens grounding in user evidence and reduces sensitivity to prior assistant turns.
Key takeaway
For Machine Learning Engineers developing multi-turn language models, CCOPD offers a robust training paradigm to overcome performance degradation from incremental information. Implement this distillation approach to significantly enhance RAW-SHARDED performance and strengthen grounding in user evidence, reducing sensitivity to prior assistant turns. This method ensures your models maintain consistent accuracy across complex conversational flows, mirroring full-context capabilities.
Key insights
Aligning multi-turn LLM behavior with full-context teacher responses mitigates self-anchored drift.
Principles
- Self-anchored drift distorts multi-turn LLM answers.
- Full-context teacher guidance improves multi-turn student grounding.
Method
CCOPD trains a student LLM incrementally on multi-turn conversations, aligning its responses with a frozen teacher LLM conditioned on the complete, canonical prompt, to reduce unsupported assumptions.
In practice
- Apply CCOPD to improve multi-turn LLM robustness.
- Use a full-context teacher to guide multi-turn student training.
Topics
- Large Language Models
- Multi-Turn Conversations
- On-Policy Distillation
- Self-Anchored Drift
- Zero-Shot Performance
- Conversational AI
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.