SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation

2026-05-08 · Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, medium

Summary

SimCT (Simple Cross-Tokenizer On-Policy Distillation) is a new method designed to improve on-policy distillation (OPD) when teacher and student models use different tokenizers. Standard OPD implicitly assumes token-by-token comparability, which fails with heterogeneous tokenizers, leading to significant loss of teacher signal. SimCT addresses this by expanding the supervision space beyond shared tokens to include short multi-token continuations that both tokenizers can process, without altering the core OPD loss form. This approach recovers supervision previously discarded by exact shared-token matching. Experiments across three diverse teacher-student pairs on mathematical reasoning and code-generation benchmarks demonstrate that SimCT consistently outperforms traditional shared-vocabulary OPD and other cross-tokenizer baselines.

Key takeaway

For AI Engineers deploying smaller student models from larger teachers with different tokenizers, SimCT offers a critical improvement. Your distillation process will be more effective by adopting SimCT's approach of comparing multi-token continuations, which restores lost teacher signal. This directly addresses a common limitation in knowledge distillation, leading to more robust and accurate student model performance on tasks like mathematical reasoning and code generation.

Key insights

SimCT recovers lost supervision in cross-tokenizer on-policy distillation by comparing multi-token continuations.

Principles

Heterogeneous tokenizers degrade on-policy distillation.
Finest jointly tokenizable units are optimal for supervision.
Coarser alternatives remove useful teacher-student distinctions.

Method

SimCT enlarges the supervision space in on-policy distillation by comparing teacher and student over short multi-token continuations that both tokenizers can realize, while keeping the OPD loss form unchanged.

In practice

Apply SimCT for cross-tokenizer model distillation.
Use multi-token continuations to restore teacher signal.
Evaluate on math reasoning and code generation tasks.

Topics

On-Policy Distillation
Cross-Tokenizer Distillation
Heterogeneous Tokenizers
Multi-token Continuations
Mathematical Reasoning

Code references

sunjie279/SimCT-

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.