Better MoE model inference with warp decode
Summary
A new inference approach called "warp decode" significantly improves the speed and accuracy of Mixture-of-Experts (MoE) model inference, particularly for small-batch decode on Blackwell GPUs. This method reorganizes the kernel around outputs rather than experts, flipping the traditional parallelism axis. Warp decode achieves a 1.84x throughput improvement and produces outputs 1.4x closer to a full FP32 reference compared to conventional MoE paths. It eliminates five "bookkeeping" steps, including padding, scattering, and combining, and removes two intermediate memory buffers by folding expert contributions into register accumulators. This design creates warp independence, allowing for better scheduling and latency hiding, and is especially beneficial for autoregressive decode steps where shared work per expert is minimal.
Key takeaway
For AI Engineers optimizing MoE model inference on Blackwell GPUs, adopting the warp decode approach is critical. Your teams can achieve nearly double the throughput and significantly higher accuracy by reorganizing parallelism around outputs rather than experts. This change streamlines the computation pipeline, reduces memory overhead, and ultimately accelerates model development and deployment cycles for applications like Composer.
Key insights
Reorganizing MoE inference around outputs instead of experts dramatically boosts small-batch decode performance and accuracy.
Principles
- Warp independence improves GPU scheduling.
- Eliminating data staging reduces overhead.
- Maintaining FP32 accumulators enhances accuracy.
Method
Warp decode assigns each GPU warp to a single output value, streaming weight data, aggregating totals across routed experts in registers, and writing one result, compressing computation into two kernels.
In practice
- Use warp decode for MoE small-batch inference.
- Prioritize output-centric parallelism for decode.
- Avoid intermediate activation quantization.
Topics
- MoE Model Inference
- Warp Decode
- Blackwell GPUs
- GPU Parallelism
- Throughput Optimization
Best for: AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Hardware Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Cursor Blog.