Distribution Corrected Offline Data Distillation for Large Language Models

2026-05-15 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A new offline reasoning distillation framework, "Distribution Corrected Offline Data Distillation for Large Language Models," addresses the trade-off between supervision quality and distributional drift in training smaller language models from larger ones. Current offline methods use high-quality, teacher-generated traces but suffer from compounding errors during inference because the student model conditions on teacher prefixes during training but self-generates prefixes during inference. Conversely, on-policy methods match the inference distribution but are costly and produce low-quality traces early on. This proposed framework maintains the efficiency and supervision quality of offline teacher data while adaptively emphasizing teacher supervision that aligns better with the student's on-policy distribution. Evaluations on mathematical reasoning benchmarks like GSM8K, MATH, MATH500, AMC, AIME, and OlympiadBench demonstrate improved reasoning accuracy and more stable traces compared to prior offline distillation algorithms, without requiring online rollouts.

Key takeaway

For AI Engineers and Research Scientists developing smaller, more efficient language models, consider integrating distribution-correction-aware training into your offline distillation pipelines. This approach allows you to leverage high-quality teacher supervision while mitigating distributional drift, leading to more accurate and stable reasoning capabilities in resource-constrained settings without the expense of online rollouts. Your models will perform better on complex reasoning tasks.

Key insights

Offline distillation can be improved by correcting teacher-student distribution drift without costly online sampling.

Principles

Offline distillation offers sample-efficient, high-quality supervision.
Distributional drift causes compounding errors in autoregressive models.

Method

The proposed method adaptively emphasizes teacher supervision that is better aligned with the student's on-policy distribution, correcting teacher-student distribution drift in offline reasoning distillation.

In practice

Improve reasoning accuracy in smaller LLMs.
Enhance stability of reasoning traces.
Preserve instruction-following capabilities.

Topics

Distribution Corrected Distillation
Large Language Models
Reasoning Traces
Offline Data Distillation
Distributional Drift

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.