Is Retraining-Free Enough? The Necessity of Router Calibration for Efficient MoE Compression

· Source: cs.LG updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, extended

Summary

Mixture-of-Experts (MoE) models, while efficient for scaling capacity, face significant memory bottlenecks during deployment due to their large parameter footprint. This work categorizes retraining-free MoE compression into three paradigms: Expert Pruning, Expert Editing, and Expert Merging. It identifies "router-expert mismatch" as the primary cause of persistent performance degradation after compression, arguing that current retraining-free methods neglect to update the router when experts are modified. To address this, the authors propose Router Knowledge Distillation (Router KD), a lightweight method that updates only the router parameters by distilling the original model's next-token distribution on unlabeled calibration data. Experiments on Qwen3-30B-A3B-Instruct-2507 and Mixtral-8x7B-Instruct-v0.1 demonstrate that Router KD consistently recovers performance across all three compression paradigms, with significantly larger gains observed in fine-grained MoEs like Qwen3 due to their more complex routing decision boundaries.

Key takeaway

For NLP Engineers and AI Scientists deploying compressed MoE models, integrating Router Knowledge Distillation (Router KD) is crucial to recover performance lost due to router-expert mismatch. Your teams should consider Router KD as a standard post-compression step, especially for fine-grained MoE architectures like Qwen3, as it offers substantial performance recovery with minimal computational overhead. This approach allows for efficient deployment without full model retraining.

Key insights

Router calibration is essential for effective retraining-free MoE compression, mitigating performance degradation from router-expert mismatch.

Principles

Method

Router Knowledge Distillation (Router KD) updates only the router parameters by distilling the original model's next-token distribution on unlabeled calibration data, minimizing computational overhead.

In practice

Topics

Best for: NLP Engineer, AI Scientist, Research Scientist, AI Researcher, AI Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.