DOT-MoE: Differentiable Optimal Transport for MoEfication

2026-06-01 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

DOT-MoE is a novel framework designed to convert pre-trained dense Large Language Models (LLMs) into sparse Mixture of Experts (MoE) architectures, addressing the inference efficiency challenges of scaling LLMs. Unlike existing methods that rely on heuristic neuron clustering or random splitting for Feed-Forward Network (FFN) partitioning, DOT-MoE formulates this decomposition as a Differentiable Optimal Transport (DOT) problem. It employs differentiable Sinkhorn-Knopp iterations to manage neuron assignment and enforce strict expert capacity constraints. Furthermore, the framework utilizes Straight-Through Estimators (STE) to jointly learn both the discrete neuron-to-expert assignment and the token-to-expert routing policy end-to-end. Experiments show DOT-MoE significantly outperforms structured pruning, heuristic clustering, and random-split baselines, retaining 90% of the original dense model's performance while reducing active parameters by 50%.

Key takeaway

For Machine Learning Engineers optimizing LLM inference efficiency, DOT-MoE offers a robust method to convert pre-trained dense models into sparse MoEs. You can achieve a 50% reduction in active parameters while retaining 90% of the original model's performance. Consider integrating this Differentiable Optimal Transport approach to significantly lower computational costs for large-scale LLM deployments.

Key insights

DOT-MoE converts dense LLMs to sparse MoEs by framing neuron decomposition as a Differentiable Optimal Transport problem for efficient inference.

Principles

MoE conversion can improve LLM inference efficiency.
Optimal Transport can model neuron assignment.
Joint learning improves discrete assignment and routing.

Method

Decompose dense layers as a Differentiable Optimal Transport problem, using Sinkhorn-Knopp iterations for neuron assignment. Jointly learn discrete neuron-to-expert assignment and token-to-expert routing via Straight-Through Estimators.

In practice

Convert existing dense LLMs to MoE.
Reduce active parameters by 50% for inference.
Achieve 90% performance of original dense models.

Topics

Mixture of Experts
Large Language Models
Differentiable Optimal Transport
Inference Efficiency
Model Sparsity
Straight-Through Estimators

Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.