Breaking the Tokenizer Barrier: On-Policy Distillation across Model Families
Summary
On-Policy Distillation (OPD), a key technique for transferring knowledge from expert Large Language Models (LLMs) to student models during post-training, has been limited by the requirement for teacher and student models to share the same tokenizer. This restriction confines OPD applicability within specific model series. Current cross-tokenizer distillation typically relies on Supervised Fine-Tuning (SFT) of teacher-generated responses, which fails to leverage the teacher's full probability distribution knowledge. A new method enables standard on-policy distillation to function across different model families by employing a precise token-mapping algorithm. This approach ensures high-fidelity token-level signals can propagate between disparate tokenizers. Experiments demonstrate that this cross-tokenizer OPD is significantly more compute-efficient than existing baselines across various benchmarks, expanding the range of compatible teacher-student LLM pairs.
Key takeaway
For Machine Learning Engineers designing LLM distillation pipelines, if your teacher and student models utilize different tokenizers, you should consider implementing cross-tokenizer On-Policy Distillation (OPD). This method, enabled by a precise token-mapping algorithm, allows you to leverage the full knowledge embedded in the teacher's probability distribution, unlike traditional SFT. It offers significantly greater compute efficiency and expands your options for combining diverse LLM architectures.
Key insights
Cross-tokenizer On-Policy Distillation (OPD) enables knowledge transfer between LLMs with different tokenizers via a precise token-mapping algorithm.
Principles
- Tokenizer compatibility limits LLM knowledge transfer.
- Precise token-mapping enables cross-tokenizer distillation.
- Cross-tokenizer OPD is compute-efficient.
Method
Enable standard on-policy distillation across model families using a precise token-mapping algorithm to propagate high-fidelity token-level signals between different tokenizers.
In practice
- Adapt LLMs from different families.
- Enhance interactions between diverse LLMs.
- Improve compute efficiency in distillation.
Topics
- On-Policy Distillation
- LLM Distillation
- Tokenizers
- Model Interoperability
- Compute Efficiency
- Knowledge Transfer
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.