Same Evidence, Different Answers: Canonical-Context On-Policy Distillation for Multi-Turn Language Models

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, quick

Summary

Canonical-Context On-Policy Distillation (CCOPD) addresses a critical issue where large language models (LLMs) fail to maintain consistent performance when the same evidence is presented incrementally across multi-turn conversations, compared to a single, full prompt. This performance gap, termed "self-anchored drift," arises from unsupported assumptions introduced by partial information. CCOPD mitigates this by training a student LLM incrementally on multi-turn dialogues, aligning its behavior with a frozen teacher LLM that processes the complete, canonical prompt. Trained specifically on math problem conversations, CCOPD achieved a 32% average relative improvement in RAW-SHARDED performance over the original base model across math and five zero-shot out-of-domain task families, while largely preserving full-context performance. Analysis indicates CCOPD strengthens grounding in user evidence and reduces sensitivity to prior assistant turns.

Key takeaway

For Machine Learning Engineers developing multi-turn language models, CCOPD offers a robust training paradigm to overcome performance degradation from incremental information. Implement this distillation approach to significantly enhance RAW-SHARDED performance and strengthen grounding in user evidence, reducing sensitivity to prior assistant turns. This method ensures your models maintain consistent accuracy across complex conversational flows, mirroring full-context capabilities.

Key insights

Aligning multi-turn LLM behavior with full-context teacher responses mitigates self-anchored drift.

Principles

Method

CCOPD trains a student LLM incrementally on multi-turn conversations, aligning its responses with a frozen teacher LLM conditioned on the complete, canonical prompt, to reduce unsupported assumptions.

In practice

Topics

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.