There are at least ten distinct technical families of teacher→student transfer, not one monolithic “distillation.”
Summary
Teacher–student transfer in AI encompasses at least ten distinct technical families, moving beyond a monolithic "distillation" concept to enable cheaper "student" models to learn from expensive "frontier" "teachers." These methods, ranging from simple RAG and prompting to complex logit-level distillation and speculative decoding, are primarily driven by economics; inference costs for GPT-3.5-level systems plummeted over 280-fold from ~\$20 to ~\$0.07 per million tokens between November 2022 and October 2024. This capability transfer, while effective on narrow tasks, is partial, reliably copying style but unevenly deep reasoning. The rapid adoption of these techniques has compressed the capability gap between leading closed-weight and best open-weight models on the Chatbot Arena Leaderboard from 8.04% in January 2024 to 1.70% in February 2025. However, significant risks include the transfer of hallucinations and bias, "subliminal learning," and "model collapse" from recursive synthetic data training, alongside contested legality of API-based distillation.
Key takeaway
For AI Scientists and ML Engineers evaluating model deployment strategies, you should prioritize cost-effective teacher-student transfer methods like RAG and prompt engineering before complex fine-tuning. Validate all distilled models with human evaluation and red-team testing, as benchmark gains alone are insufficient due to risks like "subliminal learning" and "model collapse." Carefully review API terms-of-service for training clauses to avoid legal disputes, especially if your task's open-vs-frontier performance gap is minimal.
Key insights
Teacher-student transfer involves diverse methods to imbue smaller models with frontier capabilities, driven by economics but fraught with risks.
Principles
- Capability transfer is partial, copying style more than deep reasoning.
- Distillation risks include transferring bias, hallucinations, and "subliminal learning."
- Recursive training on synthetic data can lead to "model collapse."
In practice
- RAG and prompt engineering offer cheap, reversible capability transfer.
- QLoRA allows fine-tuning 65B models on a single 48GB GPU.
Topics
- Knowledge Distillation
- Large Language Models
- Model Compression
- Synthetic Data
- Retrieval-Augmented Generation
- AI Model Risks
Code references
Best for: AI Architect, MLOps Engineer, AI Engineer, Machine Learning Engineer, AI Scientist, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Pascal’s Substack.