Boosting MoE Training Throughput with Advanced Fusion Kernels
Summary
NVIDIA has introduced advanced fused Multi-Layer Perceptron (MLP) kernels, custom-built with the CuTe DSL, to significantly boost the training throughput of Mixture-of-Experts (MoE) models. These new kernels achieve an impressive 1.3x–2x kernel-level speedup over unfused paths by addressing inherent memory and synchronization bottlenecks, and enabling sync-free MoE execution for full-iteration NVIDIA CUDA graphs. This optimization translates to substantial end-to-end performance improvements, including an 8% gain in the DeepSeek-V3 pre-training setup and a remarkable 93% gain for GPT-OSS pre-training. The kernels tackle activation bottlenecks, CPU boundedness, and quantization costs by fusing operations like GroupGemm with activation functions (SwiGLU, GeGLU, sReLU) and quantization steps. These performance-enhancing kernels are currently available in the NVIDIA cuDNN Frontend, NVIDIA Transformer Engine, and NVIDIA Megatron-Core.
Key takeaway
For AI Engineers optimizing large-scale Mixture-of-Experts (MoE) model training, you should integrate NVIDIA's new fused MLP kernels to significantly reduce training times and enhance hardware utilization. By adopting these kernels through cuDNN Frontend, Transformer Engine, or Megatron-Core, you can achieve up to 93% end-to-end speedup, directly impacting project timelines and compute costs. Prioritize updating your software stack to leverage these performance gains immediately.
Key insights
Fusing MoE block operations with custom kernels significantly boosts training throughput by eliminating bottlenecks.
Principles
- Fusing operations reduces memory I/O and maximizes utilization.
- Hardware-aware software codesign is critical for throughput.
- Eliminating CPU synchronization improves GPU utilization.
Method
The MoE block is re-designed using custom CuTe DSL kernels, fusing GroupGemm with activation functions (SwiGLU, GeGLU, sReLU) and quantization/transpose steps to create sync-free MoE execution.
In practice
- Integrate kernels via cuDNN Frontend (v1.23.0+).
- Use Transformer Engine (v2.15+) for fused operations.
- Enable features through Megatron Core (26.04-alpha.rc2+).
Topics
- Mixture-of-Experts
- GPU Optimization
- Kernel Fusion
- NVIDIA CUDA
- Deep Learning Training
- Transformer Engine
Code references
Best for: NLP Engineer, Research Scientist, Machine Learning Engineer, AI Scientist, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.