Why Half the Experts in an MoE Model May Not Be Needed

2024-06-18 · Source: To Data & Beyond · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

The ZEDA paper introduces a method to convert post-trained Mixture-of-Experts (MoE) models into more dynamic versions, significantly reducing inference costs. Standard MoE models use a fixed expert budget per token, leading to wasted computation for predictable tokens. ZEDA addresses this by adding "zero-output experts" to the routing pool, allowing the router to skip normal expert computation without removing original experts. The adapted model is then trained using the original MoE as a frozen teacher via self-distillation. Across Qwen3–30B-A3B and GLM-4.7-Flash, ZEDA replaces 51-53% of expert activations with zero experts, achieving approximately 1.20x inference speedup with only a 0.7-point average accuracy drop. This adaptation process takes less than 31 hours for Qwen3–30B-A3B and less than 62 hours for GLM-4.7-Flash on 8 NVIDIA H200 GPUs.

Key takeaway

For MLOps Engineers deploying Mixture-of-Experts models, ZEDA offers a practical path to reduce inference costs without retraining from scratch. You can achieve around 1.20x speedup by adapting existing models to dynamically skip over half of expert computations with minimal accuracy loss. Consider evaluating ZEDA for your specific MoE workloads to optimize resource utilization, especially for applications with varying token predictability.

Key insights

ZEDA dynamically reduces MoE inference cost by teaching post-trained models to skip expert computation for predictable tokens via zero-expert self-distillation.

Principles

MoE efficiency can be improved post-training.
Token-level uncertainty guides compute allocation.
Self-distillation preserves original model behavior.

Method

ZEDA adds zero-output experts to a post-trained MoE's routing pool. It then uses two-stage self-distillation (supervised fine-tuning, on-policy distillation) with a Group Auxiliary Loss to teach the model when to safely skip normal expert computation.

In practice

Adapt existing MoE models for cost savings.
Prioritize compute for uncertain tokens.
Use zero experts for structured code/math.

Topics

Mixture-of-Experts
Inference Optimization
Self-Distillation
Dynamic Routing
Large Language Models
Compute Efficiency

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by To Data & Beyond.