Breaking the Tokenizer Barrier: On-Policy Distillation across Model Families

2026-06-08 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

On-Policy Distillation (OPD), a key technique for transferring knowledge from expert Large Language Models (LLMs) to student models during post-training, has been limited by the requirement for teacher and student models to share the same tokenizer. This restriction confines OPD applicability within specific model series. Current cross-tokenizer distillation typically relies on Supervised Fine-Tuning (SFT) of teacher-generated responses, which fails to leverage the teacher's full probability distribution knowledge. A new method enables standard on-policy distillation to function across different model families by employing a precise token-mapping algorithm. This approach ensures high-fidelity token-level signals can propagate between disparate tokenizers. Experiments demonstrate that this cross-tokenizer OPD is significantly more compute-efficient than existing baselines across various benchmarks, expanding the range of compatible teacher-student LLM pairs.

Key takeaway

For Machine Learning Engineers designing LLM distillation pipelines, if your teacher and student models utilize different tokenizers, you should consider implementing cross-tokenizer On-Policy Distillation (OPD). This method, enabled by a precise token-mapping algorithm, allows you to leverage the full knowledge embedded in the teacher's probability distribution, unlike traditional SFT. It offers significantly greater compute efficiency and expands your options for combining diverse LLM architectures.

Key insights

Cross-tokenizer On-Policy Distillation (OPD) enables knowledge transfer between LLMs with different tokenizers via a precise token-mapping algorithm.

Principles

Tokenizer compatibility limits LLM knowledge transfer.
Precise token-mapping enables cross-tokenizer distillation.
Cross-tokenizer OPD is compute-efficient.

Method

Enable standard on-policy distillation across model families using a precise token-mapping algorithm to propagate high-fidelity token-level signals between different tokenizers.

In practice

Adapt LLMs from different families.
Enhance interactions between diverse LLMs.
Improve compute efficiency in distillation.

Topics

On-Policy Distillation
LLM Distillation
Tokenizers
Model Interoperability
Compute Efficiency
Knowledge Transfer

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.