Dropless MoE Training in JAX with Primus-Turbo
Summary
AMD's Primus-Turbo library addresses the efficiency trade-offs in Mixture-of-Experts (MoE) model training on JAX/MaxText, specifically for AMD Instinct GPUs. Traditionally, JAX/MaxText either drops tokens for fixed shapes ("dense_matmul") or faces memory walls with dropless "ragged_dot" and "ragged_all_to_all". Primus-Turbo introduces two Composable Kernel (CK)-backed primitives: a grouped GEMM for ragged expert matmuls and a DeepEP dispatch/combine for token-aware routing. These are exposed as first-class JAX ops via XLA FFI, enabling dropless MoE training. Experiments on DeepSeek-V3 671B with 64 AMD MI355X GPUs show Primus-Turbo's "sparse-gmm-deepep" path achieves 1179.7 TGS at pdbs=8, outperforming other dropless options and delivering a 0.16-nat lower C4 loss (5.003) compared to "dense-cf1.25" (5.163) at 2000 steps, despite a ~19% lower step rate on real data.
Key takeaway
For AI Scientists or MLOps Engineers training large Mixture-of-Experts models on AMD Instinct GPUs using JAX/MaxText, you should enable Primus-Turbo's "use_turbo_grouped_gemm" and "use_turbo_deepep_dispatch" flags. This allows for dropless training, achieving better convergence quality (e.g., 0.16-nat lower loss on C4) and higher throughput than other dropless methods, while fitting larger batch sizes. Your models will train more faithfully and efficiently, even with the routing imbalance cost on real data.
Key insights
Primus-Turbo enables efficient, dropless Mixture-of-Experts training in JAX on AMD GPUs by integrating custom grouped GEMM and DeepEP kernels.
Principles
- Dropless MoE training yields superior convergence quality.
- Custom kernel integration via JAX FFI bypasses memory limitations.
- Pessimistic allocation ensures sync-free communication.
Method
Primus-Turbo integrates CK-backed grouped GEMM and DeepEP dispatch/combine kernels into JAX via XLA FFI. It registers abstract evaluation, "custom_vjp" for autodiff, and sharding rules, with a "setup()" call to freeze runtime communication.
In practice
- Use "use_turbo_grouped_gemm=true" for ragged expert matmuls.
- Enable "use_turbo_deepep_dispatch=true" for efficient EP all-to-all.
- Implement "_ensure_deepep_setup" for once-per-process initialization.
Topics
- Mixture-of-Experts
- JAX
- AMD Instinct GPUs
- Primus-Turbo
- Grouped GEMM
- DeepEP
- MaxText
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.