GPT-4 Has 1.8 Trillion Parameters. It Uses 2% of Them Per Token.

2026-04-22 · Source: Towards AI - Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Advanced, quick

Summary

GPT-4 reportedly utilizes a Mixture of Experts (MoE) architecture, featuring 1.8 trillion parameters but activating only a small fraction, approximately 2%, per token. This approach allows models like DeepSeek-R1 (671 billion parameters, 37 billion active per token) and Mixtral 8x7B (46.7 billion parameters, performing at 13B model speed) to achieve high parameter counts while maintaining efficient inference. The MoE architecture is projected to be adopted by over 60% of frontier open-source models by April 2026. This article delves into the underlying mathematics of MoE, explaining concepts such as the gating function, load-balancing loss, the rationale behind Top-2 routing, mechanisms to prevent expert collapse, and conditions for training stability, including routing entropy requirements. It contrasts this with dense transformer FFN layers, where every input token passes through the same weight matrices W₁ and W₂.

Key takeaway

For AI Engineers optimizing large language models, understanding the mathematical underpinnings of Mixture of Experts (MoE) is crucial. Your team should investigate integrating MoE architectures to manage trillion-parameter models efficiently, potentially reducing inference costs and improving performance. Focus on implementing proper gating functions and load-balancing loss to ensure training stability and prevent expert collapse, which are critical for successful deployment.

Key insights

Mixture of Experts (MoE) enables large models to activate only a fraction of parameters per token, boosting efficiency.

Principles

MoE improves efficiency by sparse activation.
Load-balancing loss prevents expert collapse.
Routing entropy is key for MoE training stability.

Method

MoE uses a gating function to route each token to a subset of expert networks, typically Top-2, with a load-balancing loss term to distribute workload and prevent expert collapse.

In practice

Implement Top-2 routing for MoE.
Incorporate load-balancing loss in MoE training.

Topics

GPT-4
Mixture-of-Experts
Transformer Architecture
Model Parameters
Gating Function

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Towards AI - Medium.