GPT-4 Has 1.8 Trillion Parameters. It Uses 2% of Them Per Token.
Summary
GPT-4 reportedly utilizes a Mixture of Experts (MoE) architecture, featuring 1.8 trillion parameters but activating only a small fraction, approximately 2%, per token. This approach allows models like DeepSeek-R1 (671 billion parameters, 37 billion active per token) and Mixtral 8x7B (46.7 billion parameters, performing at 13B model speed) to achieve high parameter counts while maintaining efficient inference. The MoE architecture is projected to be adopted by over 60% of frontier open-source models by April 2026. This article delves into the underlying mathematics of MoE, explaining concepts such as the gating function, load-balancing loss, the rationale behind Top-2 routing, mechanisms to prevent expert collapse, and conditions for training stability, including routing entropy requirements. It contrasts this with dense transformer FFN layers, where every input token passes through the same weight matrices W₁ and W₂.
Key takeaway
For AI Engineers optimizing large language models, understanding the mathematical underpinnings of Mixture of Experts (MoE) is crucial. Your team should investigate integrating MoE architectures to manage trillion-parameter models efficiently, potentially reducing inference costs and improving performance. Focus on implementing proper gating functions and load-balancing loss to ensure training stability and prevent expert collapse, which are critical for successful deployment.
Key insights
Mixture of Experts (MoE) enables large models to activate only a fraction of parameters per token, boosting efficiency.
Principles
- MoE improves efficiency by sparse activation.
- Load-balancing loss prevents expert collapse.
- Routing entropy is key for MoE training stability.
Method
MoE uses a gating function to route each token to a subset of expert networks, typically Top-2, with a load-balancing loss term to distribute workload and prevent expert collapse.
In practice
- Implement Top-2 routing for MoE.
- Incorporate load-balancing loss in MoE training.
Topics
- GPT-4
- Mixture-of-Experts
- Transformer Architecture
- Model Parameters
- Gating Function
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Towards AI - Medium.