Building a monokernel for LLM inference on AMD MI300X - up to 3,300 output tokens/s per request [P]

· Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, quick

Summary

A new monokernel has been developed for LLM inference on AMD MI300X GPUs, achieving up to 3,300 output tokens/s per request with batch size 1, no speculative decoding, and no quantization, across 8x MI300X. This optimization maps memory access patterns directly to the physical die topology, specifically the IODs, ensuring the hardware operates at full design performance. Currently supporting a small 2B coding model, the developers plan to extend support to large frontier Mixture-of-Experts (MoE) models. The approach emphasizes leveraging MI300X topology rather than treating it as a generic CUDA box, with wave-uniform branching minimizing performance impact.

Key takeaway

For AI Engineers optimizing LLM inference latency on AMD MI300X hardware, this work demonstrates that deeply understanding and leveraging the GPU's physical die topology with a monokernel can yield significant performance gains, reaching 3,300 tokens/s. You should investigate hardware-aware kernel design, especially when planning for future large Mixture-of-Experts models, where Tensor Parallelism might be a more effective strategy than Expert Parallelism to avoid routing imbalances.

Key insights

Optimizing LLM inference on AMD MI300X by leveraging hardware die topology with a single GPU-resident monokernel.

Principles

Method

Building a monokernel that executes the full LLM decode sequence as one GPU-resident program, optimizing memory access patterns to align with the AMD MI300X's physical die topology and IODs.

In practice

Topics

Best for: NLP Engineer, AI Architect, MLOps Engineer, AI Engineer, Machine Learning Engineer, AI Hardware Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.