KD4MT: A Survey of Knowledge Distillation for Machine Translation

2026-02-19 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Advanced, long

Summary

A survey titled "KD4MT: A Survey of Knowledge Distillation for Machine Translation" synthesizes 105 papers published through October 1, 2025, on Knowledge Distillation (KD) in Machine Translation (MT). It highlights that KD in MT functions beyond mere model compression, serving as a general-purpose knowledge transfer mechanism that influences supervision, translation quality, and efficiency. The survey introduces MT and KD fundamentals, categorizes KD4MT advances by methodological contributions and practical applications, and identifies trends, research gaps, and the absence of unified evaluation practices. It also provides practical guidelines for KD method selection, discusses risks like increased hallucination and bias amplification, and explores the evolving role of Large Language Models (LLMs) in KD4MT. A public database and glossary complement the survey.

Key takeaway

For AI Scientists and Research Scientists developing or deploying Machine Translation systems, recognize that Knowledge Distillation offers more than just model compression. You should strategically apply KD to enhance translation quality, expand language coverage, or adapt models to specific domains, especially when dealing with resource constraints or the need to specialize general-purpose LLMs for MT tasks. Be mindful of potential risks like hallucination and bias amplification when implementing KD.

Key insights

KD in Machine Translation is a versatile knowledge transfer mechanism, not solely a compression technique.

Principles

KD can adapt models to specific tasks and domains.
KD can merge information from multiple models.
KD can compensate for data scarcity.

Method

KD involves training a powerful "teacher" model, then training a smaller "student" model with supervision from the trained teacher, minimizing divergence between their output distributions.

In practice

Use Word-Level KD for token-level output distribution matching.
Consider Sequence-Level KD for full decoded sequence transfer.
Explore Feature-based KD for intermediate layer knowledge transfer.

Topics

Knowledge Distillation
Machine Translation
Large Language Models
Response-based KD
Feature-based KD

Code references

Helsinki-NLP/KD4MT-survey

Best for: AI Scientist, Research Scientist, AI Researcher, NLP Engineer, AI Student

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.