Why Half the Experts in an MoE Model May Not Be Needed
Summary
The ZEDA paper introduces a method to convert post-trained Mixture-of-Experts (MoE) models into more dynamic versions, significantly reducing inference costs. Standard MoE models use a fixed expert budget per token, leading to wasted computation for predictable tokens. ZEDA addresses this by adding "zero-output experts" to the routing pool, allowing the router to skip normal expert computation without removing original experts. The adapted model is then trained using the original MoE as a frozen teacher via self-distillation. Across Qwen3–30B-A3B and GLM-4.7-Flash, ZEDA replaces 51-53% of expert activations with zero experts, achieving approximately 1.20x inference speedup with only a 0.7-point average accuracy drop. This adaptation process takes less than 31 hours for Qwen3–30B-A3B and less than 62 hours for GLM-4.7-Flash on 8 NVIDIA H200 GPUs.
Key takeaway
For MLOps Engineers deploying Mixture-of-Experts models, ZEDA offers a practical path to reduce inference costs without retraining from scratch. You can achieve around 1.20x speedup by adapting existing models to dynamically skip over half of expert computations with minimal accuracy loss. Consider evaluating ZEDA for your specific MoE workloads to optimize resource utilization, especially for applications with varying token predictability.
Key insights
ZEDA dynamically reduces MoE inference cost by teaching post-trained models to skip expert computation for predictable tokens via zero-expert self-distillation.
Principles
- MoE efficiency can be improved post-training.
- Token-level uncertainty guides compute allocation.
- Self-distillation preserves original model behavior.
Method
ZEDA adds zero-output experts to a post-trained MoE's routing pool. It then uses two-stage self-distillation (supervised fine-tuning, on-policy distillation) with a Group Auxiliary Loss to teach the model when to safely skip normal expert computation.
In practice
- Adapt existing MoE models for cost savings.
- Prioritize compute for uncertain tokens.
- Use zero experts for structured code/math.
Topics
- Mixture-of-Experts
- Inference Optimization
- Self-Distillation
- Dynamic Routing
- Large Language Models
- Compute Efficiency
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by To Data & Beyond.