FlashInfer on ROCm: High‑Throughput Prefill Attention via AITER
Summary
FlashInfer on ROCm, released on April 6, 2026, is a high-performance kernel library designed to optimize attention computation for large language model (LLM) inference on AMD Instinct GPUs. This release, updating FlashInfer on ROCm from version 0.2.5 to 0.5.3, introduces FlashAttention-2 based prefill kernels, including single-request, batched, and ragged variants, for AMD's CDNA3 and CDNA4 architectures. It complements previously ported decode kernels and supports features like Paged KV-Cache, Grouped Query Attention (GQA), and Multi-Query Attention (MQA) for efficient memory management and reduced KV cache requirements. The porting effort involved significant architectural changes, replacing NVIDIA's warp matrix operations with CDNA3/CDNA4 Matrix Fused Multiply-Add (MFMA) instructions and restructuring thread layouts to 64-thread wavefronts.
Key takeaway
For MLOps Engineers deploying LLMs on AMD Instinct GPUs, FlashInfer on ROCm significantly enhances inference efficiency. You should integrate this library to leverage optimized prefill and decode kernels, especially for models using GQA/MQA, to improve throughput and memory utilization. Consider using the provided Docker images for a streamlined setup and explore the AITER backend for specific prefill operations.
Key insights
FlashInfer on ROCm optimizes LLM inference on AMD GPUs by specializing attention kernels for prefill and decode phases.
Principles
- Optimize attention for prefill (compute-intensive) and decode (memory-bound) phases.
- Use paged KV-cache for efficient memory management.
- Adapt kernel architecture to GPU-specific matrix operations.
Method
The porting process involved restructuring four core computational stages: loading query matrices, streaming key/value data, computing query-key dot products, and performing softmax-value multiplication, specifically replacing NVIDIA's wmma with CDNA3/CDNA4 MFMA instructions.
In practice
- Use `backend="aiter"` for single and batched prefill kernels.
- Utilize Docker images for simplified FlashInfer on ROCm setup.
- Employ paged KV cache for batched LLM serving.
Topics
- FlashInfer on ROCm
- LLM Inference Serving
- AMD Instinct GPUs
- Prefill Attention Kernels
- AITER Backend
Code references
Best for: MLOps Engineer, NLP Engineer, Machine Learning Engineer, AI Engineer, AI Hardware Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.