Same Evidence, Different Answers: Canonical-Context On-Policy Distillation for Multi-Turn Language Models

2026-05-28 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, quick

Summary

Canonical-Context On-Policy Distillation (CCOPD) addresses a critical issue where large language models (LLMs) fail to maintain consistent performance when the same evidence is presented incrementally across multi-turn conversations, compared to a single, full prompt. This performance gap, termed "self-anchored drift," arises from unsupported assumptions introduced by partial information. CCOPD mitigates this by training a student LLM incrementally on multi-turn dialogues, aligning its behavior with a frozen teacher LLM that processes the complete, canonical prompt. Trained specifically on math problem conversations, CCOPD achieved a 32% average relative improvement in RAW-SHARDED performance over the original base model across math and five zero-shot out-of-domain task families, while largely preserving full-context performance. Analysis indicates CCOPD strengthens grounding in user evidence and reduces sensitivity to prior assistant turns.

Key takeaway

For Machine Learning Engineers developing multi-turn language models, CCOPD offers a robust training paradigm to overcome performance degradation from incremental information. Implement this distillation approach to significantly enhance RAW-SHARDED performance and strengthen grounding in user evidence, reducing sensitivity to prior assistant turns. This method ensures your models maintain consistent accuracy across complex conversational flows, mirroring full-context capabilities.

Key insights

Aligning multi-turn LLM behavior with full-context teacher responses mitigates self-anchored drift.

Principles

Self-anchored drift distorts multi-turn LLM answers.
Full-context teacher guidance improves multi-turn student grounding.

Method

CCOPD trains a student LLM incrementally on multi-turn conversations, aligning its responses with a frozen teacher LLM conditioned on the complete, canonical prompt, to reduce unsupported assumptions.

In practice

Apply CCOPD to improve multi-turn LLM robustness.
Use a full-context teacher to guide multi-turn student training.

Topics

Large Language Models
Multi-Turn Conversations
On-Policy Distillation
Self-Anchored Drift
Zero-Shot Performance
Conversational AI

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.