Is Retraining-Free Enough? The Necessity of Router Calibration for Efficient MoE Compression
Summary
Mixture-of-Experts (MoE) models, while efficient for scaling capacity, face significant memory bottlenecks during deployment due to their large parameter footprint. This work categorizes retraining-free MoE compression into three paradigms: Expert Pruning, Expert Editing, and Expert Merging. It identifies "router-expert mismatch" as the primary cause of persistent performance degradation after compression, arguing that current retraining-free methods neglect to update the router when experts are modified. To address this, the authors propose Router Knowledge Distillation (Router KD), a lightweight method that updates only the router parameters by distilling the original model's next-token distribution on unlabeled calibration data. Experiments on Qwen3-30B-A3B-Instruct-2507 and Mixtral-8x7B-Instruct-v0.1 demonstrate that Router KD consistently recovers performance across all three compression paradigms, with significantly larger gains observed in fine-grained MoEs like Qwen3 due to their more complex routing decision boundaries.
Key takeaway
For NLP Engineers and AI Scientists deploying compressed MoE models, integrating Router Knowledge Distillation (Router KD) is crucial to recover performance lost due to router-expert mismatch. Your teams should consider Router KD as a standard post-compression step, especially for fine-grained MoE architectures like Qwen3, as it offers substantial performance recovery with minimal computational overhead. This approach allows for efficient deployment without full model retraining.
Key insights
Router calibration is essential for effective retraining-free MoE compression, mitigating performance degradation from router-expert mismatch.
Principles
- Expert compression necessitates router calibration.
- Fine-grained MoEs benefit more from router calibration.
Method
Router Knowledge Distillation (Router KD) updates only the router parameters by distilling the original model's next-token distribution on unlabeled calibration data, minimizing computational overhead.
In practice
- Apply Router KD after MoE expert compression.
- Prioritize Router KD for fine-grained MoE architectures.
Topics
- Mixture-of-Experts
- MoE Compression
- Router Calibration
- Knowledge Distillation
- Large Language Models
Best for: NLP Engineer, AI Scientist, Research Scientist, AI Researcher, AI Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.