Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts

· Source: cs.LG updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

Amazon Stores Foundation AI researchers introduce "expert upcycling," a novel method for progressively expanding Mixture-of-Experts (MoE) model capacity during continued pre-training (CPT) by increasing the number of experts. This technique constructs an $mE$-expert model from an existing $E$-expert model through expert duplication and router extension, crucially preserving per-token inference cost by holding top-$K$ routing fixed. The duplication provides a warm initialization, allowing the expanded model to start with substantially lower loss than random initialization. Subsequent CPT then drives specialization among duplicated experts. In 7B-to-13B total parameter experiments, the upcycled model matches fixed-size baselines on validation loss and 11 downstream benchmarks while saving approximately 32% of GPU hours. When starting from an existing MoE checkpoint, savings can reach around 67%. The method includes "utility-based expert selection," which uses gradient-based importance scores to guide non-uniform duplication, tripling gap closure when CPT is limited.

Key takeaway

For AI Engineers and Research Scientists aiming to scale Mixture-of-Experts models efficiently, expert upcycling offers a principled alternative to training from scratch. You can expand an existing MoE model's capacity by duplicating experts and continuing pre-training, saving significant GPU hours (e.g., ~32% for 7B to 13B parameters) while maintaining or exceeding quality. Consider implementing utility-based expert selection to optimize the initialization of new experts, especially when CPT budgets are constrained, and ensure sufficient CPT to allow for expert specialization.

Key insights

Expert upcycling efficiently scales MoE models by duplicating experts mid-training, preserving inference cost and leveraging warm initialization.

Principles

Method

Pre-train an E-expert MoE, then apply an upcycling operator to duplicate experts and extend the router, followed by continued pre-training to specialize the new experts.

In practice

Topics

Code references

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.