Dropless MoE Training in JAX with Primus-Turbo

2026-06-10 · Source: AMD ROCm Blogs · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

AMD's Primus-Turbo library addresses the efficiency trade-offs in Mixture-of-Experts (MoE) model training on JAX/MaxText, specifically for AMD Instinct GPUs. Traditionally, JAX/MaxText either drops tokens for fixed shapes ("dense_matmul") or faces memory walls with dropless "ragged_dot" and "ragged_all_to_all". Primus-Turbo introduces two Composable Kernel (CK)-backed primitives: a grouped GEMM for ragged expert matmuls and a DeepEP dispatch/combine for token-aware routing. These are exposed as first-class JAX ops via XLA FFI, enabling dropless MoE training. Experiments on DeepSeek-V3 671B with 64 AMD MI355X GPUs show Primus-Turbo's "sparse-gmm-deepep" path achieves 1179.7 TGS at pdbs=8, outperforming other dropless options and delivering a 0.16-nat lower C4 loss (5.003) compared to "dense-cf1.25" (5.163) at 2000 steps, despite a ~19% lower step rate on real data.

Key takeaway

For AI Scientists or MLOps Engineers training large Mixture-of-Experts models on AMD Instinct GPUs using JAX/MaxText, you should enable Primus-Turbo's "use_turbo_grouped_gemm" and "use_turbo_deepep_dispatch" flags. This allows for dropless training, achieving better convergence quality (e.g., 0.16-nat lower loss on C4) and higher throughput than other dropless methods, while fitting larger batch sizes. Your models will train more faithfully and efficiently, even with the routing imbalance cost on real data.

Key insights

Primus-Turbo enables efficient, dropless Mixture-of-Experts training in JAX on AMD GPUs by integrating custom grouped GEMM and DeepEP kernels.

Principles

Dropless MoE training yields superior convergence quality.
Custom kernel integration via JAX FFI bypasses memory limitations.
Pessimistic allocation ensures sync-free communication.

Method

Primus-Turbo integrates CK-backed grouped GEMM and DeepEP dispatch/combine kernels into JAX via XLA FFI. It registers abstract evaluation, "custom_vjp" for autodiff, and sharding rules, with a "setup()" call to freeze runtime communication.

In practice

Use "use_turbo_grouped_gemm=true" for ragged expert matmuls.
Enable "use_turbo_deepep_dispatch=true" for efficient EP all-to-all.
Implement "_ensure_deepep_setup" for once-per-process initialization.

Topics

Mixture-of-Experts
JAX
AMD Instinct GPUs
Primus-Turbo
Grouped GEMM
DeepEP
MaxText

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.