Distribution Corrected Offline Data Distillation for Large Language Models
Summary
A new offline reasoning distillation framework, "Distribution Corrected Offline Data Distillation for Large Language Models," addresses the trade-off between supervision quality and distributional drift in training smaller language models from larger ones. Current offline methods use high-quality, teacher-generated traces but suffer from compounding errors during inference because the student model conditions on teacher prefixes during training but self-generates prefixes during inference. Conversely, on-policy methods match the inference distribution but are costly and produce low-quality traces early on. This proposed framework maintains the efficiency and supervision quality of offline teacher data while adaptively emphasizing teacher supervision that aligns better with the student's on-policy distribution. Evaluations on mathematical reasoning benchmarks like GSM8K, MATH, MATH500, AMC, AIME, and OlympiadBench demonstrate improved reasoning accuracy and more stable traces compared to prior offline distillation algorithms, without requiring online rollouts.
Key takeaway
For AI Engineers and Research Scientists developing smaller, more efficient language models, consider integrating distribution-correction-aware training into your offline distillation pipelines. This approach allows you to leverage high-quality teacher supervision while mitigating distributional drift, leading to more accurate and stable reasoning capabilities in resource-constrained settings without the expense of online rollouts. Your models will perform better on complex reasoning tasks.
Key insights
Offline distillation can be improved by correcting teacher-student distribution drift without costly online sampling.
Principles
- Offline distillation offers sample-efficient, high-quality supervision.
- Distributional drift causes compounding errors in autoregressive models.
Method
The proposed method adaptively emphasizes teacher supervision that is better aligned with the student's on-policy distribution, correcting teacher-student distribution drift in offline reasoning distillation.
In practice
- Improve reasoning accuracy in smaller LLMs.
- Enhance stability of reasoning traces.
- Preserve instruction-following capabilities.
Topics
- Distribution Corrected Distillation
- Large Language Models
- Reasoning Traces
- Offline Data Distillation
- Distributional Drift
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.