DOT-MoE: Differentiable Optimal Transport for MoEfication
Summary
DOT-MoE is a novel framework designed to convert pre-trained dense Large Language Models (LLMs) into sparse Mixture of Experts (MoE) architectures, addressing the inference efficiency challenges of scaling LLMs. Unlike existing methods that rely on heuristic neuron clustering or random splitting for Feed-Forward Network (FFN) partitioning, DOT-MoE formulates this decomposition as a Differentiable Optimal Transport (DOT) problem. It employs differentiable Sinkhorn-Knopp iterations to manage neuron assignment and enforce strict expert capacity constraints. Furthermore, the framework utilizes Straight-Through Estimators (STE) to jointly learn both the discrete neuron-to-expert assignment and the token-to-expert routing policy end-to-end. Experiments show DOT-MoE significantly outperforms structured pruning, heuristic clustering, and random-split baselines, retaining 90% of the original dense model's performance while reducing active parameters by 50%.
Key takeaway
For Machine Learning Engineers optimizing LLM inference efficiency, DOT-MoE offers a robust method to convert pre-trained dense models into sparse MoEs. You can achieve a 50% reduction in active parameters while retaining 90% of the original model's performance. Consider integrating this Differentiable Optimal Transport approach to significantly lower computational costs for large-scale LLM deployments.
Key insights
DOT-MoE converts dense LLMs to sparse MoEs by framing neuron decomposition as a Differentiable Optimal Transport problem for efficient inference.
Principles
- MoE conversion can improve LLM inference efficiency.
- Optimal Transport can model neuron assignment.
- Joint learning improves discrete assignment and routing.
Method
Decompose dense layers as a Differentiable Optimal Transport problem, using Sinkhorn-Knopp iterations for neuron assignment. Jointly learn discrete neuron-to-expert assignment and token-to-expert routing via Straight-Through Estimators.
In practice
- Convert existing dense LLMs to MoE.
- Reduce active parameters by 50% for inference.
- Achieve 90% performance of original dense models.
Topics
- Mixture of Experts
- Large Language Models
- Differentiable Optimal Transport
- Inference Efficiency
- Model Sparsity
- Straight-Through Estimators
Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.