SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation
Summary
SimCT (Simple Cross-Tokenizer On-Policy Distillation) is a new method designed to improve on-policy distillation (OPD) when teacher and student models use different tokenizers. Standard OPD implicitly assumes token-by-token comparability, which fails with heterogeneous tokenizers, leading to significant loss of teacher signal. SimCT addresses this by expanding the supervision space beyond shared tokens to include short multi-token continuations that both tokenizers can process, without altering the core OPD loss form. This approach recovers supervision previously discarded by exact shared-token matching. Experiments across three diverse teacher-student pairs on mathematical reasoning and code-generation benchmarks demonstrate that SimCT consistently outperforms traditional shared-vocabulary OPD and other cross-tokenizer baselines.
Key takeaway
For AI Engineers deploying smaller student models from larger teachers with different tokenizers, SimCT offers a critical improvement. Your distillation process will be more effective by adopting SimCT's approach of comparing multi-token continuations, which restores lost teacher signal. This directly addresses a common limitation in knowledge distillation, leading to more robust and accurate student model performance on tasks like mathematical reasoning and code generation.
Key insights
SimCT recovers lost supervision in cross-tokenizer on-policy distillation by comparing multi-token continuations.
Principles
- Heterogeneous tokenizers degrade on-policy distillation.
- Finest jointly tokenizable units are optimal for supervision.
- Coarser alternatives remove useful teacher-student distinctions.
Method
SimCT enlarges the supervision space in on-policy distillation by comparing teacher and student over short multi-token continuations that both tokenizers can realize, while keeping the OPD loss form unchanged.
In practice
- Apply SimCT for cross-tokenizer model distillation.
- Use multi-token continuations to restore teacher signal.
- Evaluate on math reasoning and code generation tasks.
Topics
- On-Policy Distillation
- Cross-Tokenizer Distillation
- Heterogeneous Tokenizers
- Multi-token Continuations
- Mathematical Reasoning
Code references
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.