Building a monokernel for LLM inference on AMD MI300X - up to 3,300 output tokens/s per request [P]
Summary
A new monokernel has been developed for LLM inference on AMD MI300X GPUs, achieving up to 3,300 output tokens/s per request with batch size 1, no speculative decoding, and no quantization, across 8x MI300X. This optimization maps memory access patterns directly to the physical die topology, specifically the IODs, ensuring the hardware operates at full design performance. Currently supporting a small 2B coding model, the developers plan to extend support to large frontier Mixture-of-Experts (MoE) models. The approach emphasizes leveraging MI300X topology rather than treating it as a generic CUDA box, with wave-uniform branching minimizing performance impact.
Key takeaway
For AI Engineers optimizing LLM inference latency on AMD MI300X hardware, this work demonstrates that deeply understanding and leveraging the GPU's physical die topology with a monokernel can yield significant performance gains, reaching 3,300 tokens/s. You should investigate hardware-aware kernel design, especially when planning for future large Mixture-of-Experts models, where Tensor Parallelism might be a more effective strategy than Expert Parallelism to avoid routing imbalances.
Key insights
Optimizing LLM inference on AMD MI300X by leveraging hardware die topology with a single GPU-resident monokernel.
Principles
- Mapping memory access to physical die topology is crucial for performance
- Wave-uniform branching in monokernels does not significantly harm performance
- Flops are often more abundant than memory bandwidth
Method
Building a monokernel that executes the full LLM decode sequence as one GPU-resident program, optimizing memory access patterns to align with the AMD MI300X's physical die topology and IODs.
In practice
- Achieve 3,300 output tokens/s per request on 8x MI300X for LLM inference
- Consider using Tensor Parallelism (TP) instead of Expert Parallelism (EP) for MoE models to mitigate routing imbalance
Topics
- LLM Inference
- AMD MI300X
- Monokernel
- GPU Architecture
- Performance Optimization
- Mixture-of-Experts
Best for: NLP Engineer, AI Architect, MLOps Engineer, AI Engineer, Machine Learning Engineer, AI Hardware Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.