There are at least ten distinct technical families of teacher→student transfer, not one monolithic “distillation.”

2025-11-28 · Source: Pascal’s Substack · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Advanced, long

Summary

Teacher–student transfer in AI encompasses at least ten distinct technical families, moving beyond a monolithic "distillation" concept to enable cheaper "student" models to learn from expensive "frontier" "teachers." These methods, ranging from simple RAG and prompting to complex logit-level distillation and speculative decoding, are primarily driven by economics; inference costs for GPT-3.5-level systems plummeted over 280-fold from ~\$20 to ~\$0.07 per million tokens between November 2022 and October 2024. This capability transfer, while effective on narrow tasks, is partial, reliably copying style but unevenly deep reasoning. The rapid adoption of these techniques has compressed the capability gap between leading closed-weight and best open-weight models on the Chatbot Arena Leaderboard from 8.04% in January 2024 to 1.70% in February 2025. However, significant risks include the transfer of hallucinations and bias, "subliminal learning," and "model collapse" from recursive synthetic data training, alongside contested legality of API-based distillation.

Key takeaway

For AI Scientists and ML Engineers evaluating model deployment strategies, you should prioritize cost-effective teacher-student transfer methods like RAG and prompt engineering before complex fine-tuning. Validate all distilled models with human evaluation and red-team testing, as benchmark gains alone are insufficient due to risks like "subliminal learning" and "model collapse." Carefully review API terms-of-service for training clauses to avoid legal disputes, especially if your task's open-vs-frontier performance gap is minimal.

Key insights

Teacher-student transfer involves diverse methods to imbue smaller models with frontier capabilities, driven by economics but fraught with risks.

Principles

Capability transfer is partial, copying style more than deep reasoning.
Distillation risks include transferring bias, hallucinations, and "subliminal learning."
Recursive training on synthetic data can lead to "model collapse."

In practice

RAG and prompt engineering offer cheap, reversible capability transfer.
QLoRA allows fine-tuning 65B models on a single 48GB GPU.

Topics

Knowledge Distillation
Large Language Models
Model Compression
Synthetic Data
Retrieval-Augmented Generation
AI Model Risks

Code references

tatsu-lab/stanford_alpaca

Best for: AI Architect, MLOps Engineer, AI Engineer, Machine Learning Engineer, AI Scientist, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Pascal’s Substack.