Mixture of Experts (MoEs) in Transformers

2026-02-25 · Source: Hugging Face - Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, extended

Summary

The Hugging Face blog post, "Mixture of Experts (MoEs) in Transformers," published on February 26, 2026, details how the `transformers` library has evolved to support sparse Mixture of Experts (MoE) architectures. MoEs replace dense feed-forward layers with a set of "experts," where a router selects a subset of experts for each token, enabling models with high total parameters but low active parameters for faster inference. For example, `gpt-oss-20b` has 21B total parameters but uses ~3.6B active parameters per token, achieving ~115 tokens per second. The article highlights key engineering work, including a weight loading refactor using `WeightConverter` for dynamic and efficient loading of packed expert tensors, an Experts Backend system for pluggable execution strategies (`eager`, `batched_mm`, `grouped_mm`), and Expert Parallelism for distributing experts across devices. It also notes collaboration with Unsloth for significantly faster MoE training, achieving up to 12x speedup and 35% VRAM reduction.

Key takeaway

For AI Engineers deploying or training large language models, understanding MoE architectures and the `transformers` library's support is crucial. The new weight loading refactor and Expert Backend system significantly improve inference speed and memory efficiency. You should explore `WeightConverter` for optimized model loading and consider `enable_expert_parallel` for scaling models beyond single-GPU limits, especially when working with models like DeepSeek R1 or Mixtral-8x7B. Leveraging Unsloth's optimizations can also drastically reduce MoE training time and VRAM usage.

Key insights

MoEs offer high model capacity with reduced inference costs by activating only a subset of experts per token.

Principles

Model capacity scales with total parameters, but inference speed scales with active parameters.
Expert parallelism distributes experts across devices, enabling larger models without increasing computation cost.
Dynamic weight loading and conversion pipelines optimize MoE model initialization.

Method

The `transformers` library uses a `WeightConverter` for dynamic weight loading, an Experts Backend for pluggable execution strategies (e.g., `grouped_mm`), and `enable_expert_parallel` for distributing experts across multiple devices.

In practice

Use `AutoModelForCausalLM.from_pretrained("model_id")` with `HF_ENABLE_PARALLEL_LOADING` for faster MoE loading.
Employ `DistributedConfig(enable_expert_parallel=True)` for expert parallelism in large MoE models.
Integrate Unsloth for up to 12x faster MoE training and VRAM reduction.

Topics

Mixture-of-Experts
Transformers Library
Weight Loading
Expert Parallelism
MoE Training

Code references

Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Hugging Face - Blog.