SoftMoE: Soft Differentiable Routing for Mixture-of-Experts in LLMs

2026-06-16 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

SoftMoE introduces a novel approach for Mixture-of-Experts (MoE) architectures in Large Language Models, addressing the limitations of discrete top-$k$ routing. It replaces this with a truncated soft top-$k$ LapSum relaxation, enabling gradient-based optimization of expert routing. The model also parameterizes the mean number of active experts per layer and enforces a global budget constraint, allowing it to learn optimal expert capacity allocation. SoftMoE maintains compatibility with autoregressive modeling and achieves performance comparable to or better than sparse MoE on language modeling and downstream tasks, while activating significantly fewer experts. A notable finding is the highly non-uniform learned allocation, with later layers consistently activating more experts.

Key takeaway

For machine learning engineers optimizing Large Language Models with Mixture-of-Experts, SoftMoE offers a method to achieve comparable or superior performance with fewer active experts. You should consider integrating its differentiable routing and learned expert allocation to improve computational efficiency and potentially discover more effective expert distribution strategies, especially for later layers.

Key insights

SoftMoE enables differentiable expert routing and learned capacity allocation in MoE LLMs, activating fewer experts for comparable performance.

Principles

Differentiable routing optimizes expert allocation.
Expert capacity can be learned across layers.
Later layers may require more experts.

Method

SoftMoE replaces discrete top-$k$ routing with a truncated soft top-$k$ LapSum relaxation. It parameterizes the mean active experts per layer and applies a global budget constraint for gradient-based optimization.

In practice

Implement LapSum relaxation for MoE routing.
Explore non-uniform expert allocation in deep models.
Utilize provided source code for SoftMoE.

Topics

Mixture-of-Experts
Large Language Models
Differentiable Routing
Expert Allocation
Sparse MoE
Autoregressive Models

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.