Boosting MoE Training Throughput with Advanced Fusion Kernels

2026-06-15 · Source: NVIDIA Technical Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, medium

Summary

NVIDIA has introduced advanced fused Multi-Layer Perceptron (MLP) kernels, custom-built with the CuTe DSL, to significantly boost the training throughput of Mixture-of-Experts (MoE) models. These new kernels achieve an impressive 1.3x–2x kernel-level speedup over unfused paths by addressing inherent memory and synchronization bottlenecks, and enabling sync-free MoE execution for full-iteration NVIDIA CUDA graphs. This optimization translates to substantial end-to-end performance improvements, including an 8% gain in the DeepSeek-V3 pre-training setup and a remarkable 93% gain for GPT-OSS pre-training. The kernels tackle activation bottlenecks, CPU boundedness, and quantization costs by fusing operations like GroupGemm with activation functions (SwiGLU, GeGLU, sReLU) and quantization steps. These performance-enhancing kernels are currently available in the NVIDIA cuDNN Frontend, NVIDIA Transformer Engine, and NVIDIA Megatron-Core.

Key takeaway

For AI Engineers optimizing large-scale Mixture-of-Experts (MoE) model training, you should integrate NVIDIA's new fused MLP kernels to significantly reduce training times and enhance hardware utilization. By adopting these kernels through cuDNN Frontend, Transformer Engine, or Megatron-Core, you can achieve up to 93% end-to-end speedup, directly impacting project timelines and compute costs. Prioritize updating your software stack to leverage these performance gains immediately.

Key insights

Fusing MoE block operations with custom kernels significantly boosts training throughput by eliminating bottlenecks.

Principles

Fusing operations reduces memory I/O and maximizes utilization.
Hardware-aware software codesign is critical for throughput.
Eliminating CPU synchronization improves GPU utilization.

Method

The MoE block is re-designed using custom CuTe DSL kernels, fusing GroupGemm with activation functions (SwiGLU, GeGLU, sReLU) and quantization/transpose steps to create sync-free MoE execution.

In practice

Integrate kernels via cuDNN Frontend (v1.23.0+).
Use Transformer Engine (v2.15+) for fused operations.
Enable features through Megatron Core (26.04-alpha.rc2+).

Topics

Mixture-of-Experts
GPU Optimization
Kernel Fusion
NVIDIA CUDA
Deep Learning Training
Transformer Engine

Code references

NVIDIA/cudnn-frontend

Best for: NLP Engineer, Research Scientist, Machine Learning Engineer, AI Scientist, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.