Redesign Mixture-of-Experts Routers with Manifold Power Iteration
Summary
A novel approach, Manifold Power Iteration (MPI), redesigns routers in Mixture-of-Experts (MoE) models to enhance their effectiveness. Routers, crucial for activating expert subsets, traditionally lack design principles to encode expert matrices into representative vectors that accurately reflect token-expert affinity. MPI addresses this by aligning each router row with the principal singular direction of its associated expert, which offers the most expressive mathematical description of a matrix. The method employs a "Power-then-Retract" paradigm, applying a power iteration step to router weights followed by a retraction to enforce a norm constraint, ensuring both efficiency and stability. Theoretical analysis indicates MPI drives router rows to converge towards these principal singular directions. Empirical pretraining of MoE models, ranging from 1B to 11B parameters, confirms this alignment significantly facilitates more effective MoE models. The paper was published on 2026-06-10.
Key takeaway
For Machine Learning Engineers designing or optimizing Mixture-of-Experts models, consider integrating Manifold Power Iteration (MPI) into your router design. This method, by aligning router rows with expert principal singular directions, demonstrably improves MoE model effectiveness across scales from 1B to 11B parameters. Implementing the "Power-then-Retract" paradigm can enhance both efficiency and stability, potentially leading to more performant and stable large-scale MoE deployments.
Key insights
Manifold Power Iteration (MPI) aligns MoE router rows with expert principal singular directions for improved token-expert affinity.
Principles
- Align router rows with expert principal singular directions.
- Principal singular direction offers most expressive matrix description.
- Impose norm constraints for efficiency and stability.
Method
MPI uses a "Power-then-Retract" paradigm: perform power iteration on router weights, then retract to impose a norm constraint.
In practice
- Apply MPI to pretrain MoE models.
- Enhance MoE model effectiveness from 1B to 11B parameters.
Topics
- Mixture-of-Experts
- MoE Routers
- Manifold Power Iteration
- Singular Directions
- Model Pretraining
- Large-scale Models
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.