KD4MT: A Survey of Knowledge Distillation for Machine Translation
Summary
A survey titled "KD4MT: A Survey of Knowledge Distillation for Machine Translation" synthesizes 105 papers published through October 1, 2025, on Knowledge Distillation (KD) in Machine Translation (MT). It highlights that KD in MT functions beyond mere model compression, serving as a general-purpose knowledge transfer mechanism that influences supervision, translation quality, and efficiency. The survey introduces MT and KD fundamentals, categorizes KD4MT advances by methodological contributions and practical applications, and identifies trends, research gaps, and the absence of unified evaluation practices. It also provides practical guidelines for KD method selection, discusses risks like increased hallucination and bias amplification, and explores the evolving role of Large Language Models (LLMs) in KD4MT. A public database and glossary complement the survey.
Key takeaway
For AI Scientists and Research Scientists developing or deploying Machine Translation systems, recognize that Knowledge Distillation offers more than just model compression. You should strategically apply KD to enhance translation quality, expand language coverage, or adapt models to specific domains, especially when dealing with resource constraints or the need to specialize general-purpose LLMs for MT tasks. Be mindful of potential risks like hallucination and bias amplification when implementing KD.
Key insights
KD in Machine Translation is a versatile knowledge transfer mechanism, not solely a compression technique.
Principles
- KD can adapt models to specific tasks and domains.
- KD can merge information from multiple models.
- KD can compensate for data scarcity.
Method
KD involves training a powerful "teacher" model, then training a smaller "student" model with supervision from the trained teacher, minimizing divergence between their output distributions.
In practice
- Use Word-Level KD for token-level output distribution matching.
- Consider Sequence-Level KD for full decoded sequence transfer.
- Explore Feature-based KD for intermediate layer knowledge transfer.
Topics
- Knowledge Distillation
- Machine Translation
- Large Language Models
- Response-based KD
- Feature-based KD
Code references
Best for: AI Scientist, Research Scientist, AI Researcher, NLP Engineer, AI Student
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.