Mixture of Experts (MoEs) in Transformers
Summary
The Hugging Face blog post, "Mixture of Experts (MoEs) in Transformers," published on February 26, 2026, details how the `transformers` library has evolved to support sparse Mixture of Experts (MoE) architectures. MoEs replace dense feed-forward layers with a set of "experts," where a router selects a subset of experts for each token, enabling models with high total parameters but low active parameters for faster inference. For example, `gpt-oss-20b` has 21B total parameters but uses ~3.6B active parameters per token, achieving ~115 tokens per second. The article highlights key engineering work, including a weight loading refactor using `WeightConverter` for dynamic and efficient loading of packed expert tensors, an Experts Backend system for pluggable execution strategies (`eager`, `batched_mm`, `grouped_mm`), and Expert Parallelism for distributing experts across devices. It also notes collaboration with Unsloth for significantly faster MoE training, achieving up to 12x speedup and 35% VRAM reduction.
Key takeaway
For AI Engineers deploying or training large language models, understanding MoE architectures and the `transformers` library's support is crucial. The new weight loading refactor and Expert Backend system significantly improve inference speed and memory efficiency. You should explore `WeightConverter` for optimized model loading and consider `enable_expert_parallel` for scaling models beyond single-GPU limits, especially when working with models like DeepSeek R1 or Mixtral-8x7B. Leveraging Unsloth's optimizations can also drastically reduce MoE training time and VRAM usage.
Key insights
MoEs offer high model capacity with reduced inference costs by activating only a subset of experts per token.
Principles
- Model capacity scales with total parameters, but inference speed scales with active parameters.
- Expert parallelism distributes experts across devices, enabling larger models without increasing computation cost.
- Dynamic weight loading and conversion pipelines optimize MoE model initialization.
Method
The `transformers` library uses a `WeightConverter` for dynamic weight loading, an Experts Backend for pluggable execution strategies (e.g., `grouped_mm`), and `enable_expert_parallel` for distributing experts across multiple devices.
In practice
- Use `AutoModelForCausalLM.from_pretrained("model_id")` with `HF_ENABLE_PARALLEL_LOADING` for faster MoE loading.
- Employ `DistributedConfig(enable_expert_parallel=True)` for expert parallelism in large MoE models.
- Integrate Unsloth for up to 12x faster MoE training and VRAM reduction.
Topics
- Mixture-of-Experts
- Transformers Library
- Weight Loading
- Expert Parallelism
- MoE Training
Code references
Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Hugging Face - Blog.