Post-Trained MoE Can Skip Half Experts via Self-Distillation

2026-05-18 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Zero-Expert Self-Distillation Adaptation (ZEDA) is a new framework designed to convert post-trained static Mixture-of-Experts (MoE) language models into more efficient dynamic versions. Unlike existing dynamic MoE methods that require pre-training or task-specific adaptation, ZEDA focuses on practical conversion of already trained models to reduce inference costs. It achieves this by injecting parameter-free zero-output experts into each MoE layer and then adapting the augmented model using a two-stage self-distillation process. The original MoE model serves as a frozen teacher, and a group-level balancing loss is applied. ZEDA was evaluated on Qwen3-30B-A3B and GLM-4.7-Flash across 11 benchmarks, including math, code, and instruction following, demonstrating over 50% reduction in expert FLOPs with minimal accuracy loss and achieving approximately 1.20x end-to-end inference speedup.

Key takeaway

For AI Engineers and Research Scientists deploying large MoE models, ZEDA offers a practical pathway to significantly reduce inference costs and improve speed without extensive retraining. You can convert existing static MoE models to dynamic ones, potentially cutting expert FLOPs by over 50% and gaining a 1.20x speedup. Consider integrating ZEDA to optimize your deployed MoE architectures for better efficiency and lower operational expenses.

Key insights

ZEDA efficiently converts static MoE models to dynamic ones, reducing inference costs via self-distillation.

Principles

Post-training adaptation can enhance MoE efficiency.
Self-distillation enables architectural conversion.
Zero-output experts facilitate dynamic sparsity.

Method

ZEDA injects parameter-free zero-output experts into MoE layers, then uses two-stage self-distillation with the original MoE as a frozen teacher and a group-level balancing loss for adaptation.

In practice

Reduce MoE inference costs by 50% expert FLOPs.
Achieve ~1.20x inference speedup on MoE models.
Apply to Qwen3-30B-A3B and GLM-4.7-Flash.

Topics

Mixture-of-Experts
Self-Distillation
Dynamic MoE
Inference Optimization
Large Language Models

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.