Building a monokernel for LLM inference on AMD MI300X - up to 3,300 output tokens/s per request [P]

2026-05-29 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, quick

Summary

A new monokernel has been developed for LLM inference on AMD MI300X GPUs, achieving up to 3,300 output tokens/s per request with batch size 1, no speculative decoding, and no quantization, across 8x MI300X. This optimization maps memory access patterns directly to the physical die topology, specifically the IODs, ensuring the hardware operates at full design performance. Currently supporting a small 2B coding model, the developers plan to extend support to large frontier Mixture-of-Experts (MoE) models. The approach emphasizes leveraging MI300X topology rather than treating it as a generic CUDA box, with wave-uniform branching minimizing performance impact.

Key takeaway

For AI Engineers optimizing LLM inference latency on AMD MI300X hardware, this work demonstrates that deeply understanding and leveraging the GPU's physical die topology with a monokernel can yield significant performance gains, reaching 3,300 tokens/s. You should investigate hardware-aware kernel design, especially when planning for future large Mixture-of-Experts models, where Tensor Parallelism might be a more effective strategy than Expert Parallelism to avoid routing imbalances.

Key insights

Optimizing LLM inference on AMD MI300X by leveraging hardware die topology with a single GPU-resident monokernel.

Principles

Mapping memory access to physical die topology is crucial for performance
Wave-uniform branching in monokernels does not significantly harm performance
Flops are often more abundant than memory bandwidth

Method

Building a monokernel that executes the full LLM decode sequence as one GPU-resident program, optimizing memory access patterns to align with the AMD MI300X's physical die topology and IODs.

In practice

Achieve 3,300 output tokens/s per request on 8x MI300X for LLM inference
Consider using Tensor Parallelism (TP) instead of Expert Parallelism (EP) for MoE models to mitigate routing imbalance

Topics

LLM Inference
AMD MI300X
Monokernel
GPU Architecture
Performance Optimization
Mixture-of-Experts

Best for: NLP Engineer, AI Architect, MLOps Engineer, AI Engineer, Machine Learning Engineer, AI Hardware Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.