Wasserstein Distance is Just Moving Dirt
Summary
The article introduces Wasserstein distance as a superior metric for comparing probability distributions, addressing the limitations of classical measures like KL divergence. KL divergence often yields an infinite value when distributions do not perfectly overlap, even if they are nearly identical, providing no useful gradient. In contrast, Wasserstein distance conceptualizes distributions as "piles of dirt" and quantifies the minimum "work" (mass times distance) required to transform one distribution into another. This "work" is formalized through a "coupling," a joint distribution representing a transport plan, with the Wasserstein distance being the infimum of the average transport cost over all possible couplings. This approach results in a smooth, finite gradient, even for slightly shifted distributions, making it particularly effective for applications such as training Generative Adversarial Networks (GANs).
Key takeaway
For Machine Learning Engineers training generative models or Data Scientists comparing complex, non-overlapping distributions, understanding Wasserstein distance is crucial. It provides stable, meaningful gradients where traditional metrics like KL divergence fail, preventing training collapse or misleading similarity assessments. You should consider implementing Wasserstein distance in your model architectures, especially for GANs, to achieve more robust and effective training outcomes.
Key insights
Wasserstein distance measures the minimum work to transform one distribution into another, providing useful gradients where classical metrics fail.
Principles
- Classical distances max out or blow up for non-overlapping distributions.
- Wasserstein distance provides a smooth, finite gradient.
- Distance can be about deforming geometries, not pointwise overlap.
Method
Formalize a transport plan with a coupling (joint distribution). Calculate the average transport cost for each plan. The minimum total work over all plans is the Wasserstein distance.
In practice
- Use Wasserstein distance for stable GAN training.
- Apply when distributions don't perfectly overlap.
Topics
- Wasserstein Distance
- KL Divergence
- Probability Distributions
- Generative Adversarial Networks
- Machine Learning Metrics
- Optimal Transport
Best for: AI Scientist, Research Scientist, AI Engineer, AI Student, Machine Learning Engineer, Data Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by DataMListic.