Better MoE model inference with warp decode

2026-04-06 · Source: Cursor Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, medium

Summary

A new inference approach called "warp decode" significantly improves the speed and accuracy of Mixture-of-Experts (MoE) model inference, particularly for small-batch decode on Blackwell GPUs. This method reorganizes the kernel around outputs rather than experts, flipping the traditional parallelism axis. Warp decode achieves a 1.84x throughput improvement and produces outputs 1.4x closer to a full FP32 reference compared to conventional MoE paths. It eliminates five "bookkeeping" steps, including padding, scattering, and combining, and removes two intermediate memory buffers by folding expert contributions into register accumulators. This design creates warp independence, allowing for better scheduling and latency hiding, and is especially beneficial for autoregressive decode steps where shared work per expert is minimal.

Key takeaway

For AI Engineers optimizing MoE model inference on Blackwell GPUs, adopting the warp decode approach is critical. Your teams can achieve nearly double the throughput and significantly higher accuracy by reorganizing parallelism around outputs rather than experts. This change streamlines the computation pipeline, reduces memory overhead, and ultimately accelerates model development and deployment cycles for applications like Composer.

Key insights

Reorganizing MoE inference around outputs instead of experts dramatically boosts small-batch decode performance and accuracy.

Principles

Warp independence improves GPU scheduling.
Eliminating data staging reduces overhead.
Maintaining FP32 accumulators enhances accuracy.

Method

Warp decode assigns each GPU warp to a single output value, streaming weight data, aggregating totals across routed experts in registers, and writing one result, compressing computation into two kernels.

In practice

Use warp decode for MoE small-batch inference.
Prioritize output-centric parallelism for decode.
Avoid intermediate activation quantization.

Topics

MoE Model Inference
Warp Decode
Blackwell GPUs
GPU Parallelism
Throughput Optimization

Best for: AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Hardware Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Cursor Blog.