SoftMoE: Soft Differentiable Routing for Mixture-of-Experts in LLMs

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

SoftMoE introduces a novel approach for Mixture-of-Experts (MoE) architectures in Large Language Models, addressing the limitations of discrete top-$k$ routing. It replaces this with a truncated soft top-$k$ LapSum relaxation, enabling gradient-based optimization of expert routing. The model also parameterizes the mean number of active experts per layer and enforces a global budget constraint, allowing it to learn optimal expert capacity allocation. SoftMoE maintains compatibility with autoregressive modeling and achieves performance comparable to or better than sparse MoE on language modeling and downstream tasks, while activating significantly fewer experts. A notable finding is the highly non-uniform learned allocation, with later layers consistently activating more experts.

Key takeaway

For machine learning engineers optimizing Large Language Models with Mixture-of-Experts, SoftMoE offers a method to achieve comparable or superior performance with fewer active experts. You should consider integrating its differentiable routing and learned expert allocation to improve computational efficiency and potentially discover more effective expert distribution strategies, especially for later layers.

Key insights

SoftMoE enables differentiable expert routing and learned capacity allocation in MoE LLMs, activating fewer experts for comparable performance.

Principles

Method

SoftMoE replaces discrete top-$k$ routing with a truncated soft top-$k$ LapSum relaxation. It parameterizes the mean active experts per layer and applies a global budget constraint for gradient-based optimization.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.