Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts
Summary
Amazon Stores Foundation AI researchers introduce "expert upcycling," a novel method for progressively expanding Mixture-of-Experts (MoE) model capacity during continued pre-training (CPT) by increasing the number of experts. This technique constructs an $mE$-expert model from an existing $E$-expert model through expert duplication and router extension, crucially preserving per-token inference cost by holding top-$K$ routing fixed. The duplication provides a warm initialization, allowing the expanded model to start with substantially lower loss than random initialization. Subsequent CPT then drives specialization among duplicated experts. In 7B-to-13B total parameter experiments, the upcycled model matches fixed-size baselines on validation loss and 11 downstream benchmarks while saving approximately 32% of GPU hours. When starting from an existing MoE checkpoint, savings can reach around 67%. The method includes "utility-based expert selection," which uses gradient-based importance scores to guide non-uniform duplication, tripling gap closure when CPT is limited.
Key takeaway
For AI Engineers and Research Scientists aiming to scale Mixture-of-Experts models efficiently, expert upcycling offers a principled alternative to training from scratch. You can expand an existing MoE model's capacity by duplicating experts and continuing pre-training, saving significant GPU hours (e.g., ~32% for 7B to 13B parameters) while maintaining or exceeding quality. Consider implementing utility-based expert selection to optimize the initialization of new experts, especially when CPT budgets are constrained, and ensure sufficient CPT to allow for expert specialization.
Key insights
Expert upcycling efficiently scales MoE models by duplicating experts mid-training, preserving inference cost and leveraging warm initialization.
Principles
- Increasing expert count at fixed Top-K improves MoE quality.
- Warm initialization significantly reduces training loss post-expansion.
- Gradient-based utility scores guide effective expert duplication.
Method
Pre-train an E-expert MoE, then apply an upcycling operator to duplicate experts and extend the router, followed by continued pre-training to specialize the new experts.
In practice
- Use expert upcycling to expand MoE capacity without full retraining.
- Prioritize utility-based expert selection for better initialization.
- Allocate at least 50% CPT for strong quality gap closure.
Topics
- Mixture-of-Experts
- Expert Upcycling
- Large Language Models
- Compute Efficiency
- Continued Pre-training
Code references
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.