Mixture of Experts (MoEs) in Transformers

· Source: Hugging Face - Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, extended

Summary

The Hugging Face blog post, "Mixture of Experts (MoEs) in Transformers," published on February 26, 2026, details how the `transformers` library has evolved to support sparse Mixture of Experts (MoE) architectures. MoEs replace dense feed-forward layers with a set of "experts," where a router selects a subset of experts for each token, enabling models with high total parameters but low active parameters for faster inference. For example, `gpt-oss-20b` has 21B total parameters but uses ~3.6B active parameters per token, achieving ~115 tokens per second. The article highlights key engineering work, including a weight loading refactor using `WeightConverter` for dynamic and efficient loading of packed expert tensors, an Experts Backend system for pluggable execution strategies (`eager`, `batched_mm`, `grouped_mm`), and Expert Parallelism for distributing experts across devices. It also notes collaboration with Unsloth for significantly faster MoE training, achieving up to 12x speedup and 35% VRAM reduction.

Key takeaway

For AI Engineers deploying or training large language models, understanding MoE architectures and the `transformers` library's support is crucial. The new weight loading refactor and Expert Backend system significantly improve inference speed and memory efficiency. You should explore `WeightConverter` for optimized model loading and consider `enable_expert_parallel` for scaling models beyond single-GPU limits, especially when working with models like DeepSeek R1 or Mixtral-8x7B. Leveraging Unsloth's optimizations can also drastically reduce MoE training time and VRAM usage.

Key insights

MoEs offer high model capacity with reduced inference costs by activating only a subset of experts per token.

Principles

Method

The `transformers` library uses a `WeightConverter` for dynamic weight loading, an Experts Backend for pluggable execution strategies (e.g., `grouped_mm`), and `enable_expert_parallel` for distributing experts across multiple devices.

In practice

Topics

Code references

Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Hugging Face - Blog.