FlashInfer on ROCm: High‑Throughput Prefill Attention via AITER

· Source: AMD ROCm Blogs · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, medium

Summary

FlashInfer on ROCm, released on April 6, 2026, is a high-performance kernel library designed to optimize attention computation for large language model (LLM) inference on AMD Instinct GPUs. This release, updating FlashInfer on ROCm from version 0.2.5 to 0.5.3, introduces FlashAttention-2 based prefill kernels, including single-request, batched, and ragged variants, for AMD's CDNA3 and CDNA4 architectures. It complements previously ported decode kernels and supports features like Paged KV-Cache, Grouped Query Attention (GQA), and Multi-Query Attention (MQA) for efficient memory management and reduced KV cache requirements. The porting effort involved significant architectural changes, replacing NVIDIA's warp matrix operations with CDNA3/CDNA4 Matrix Fused Multiply-Add (MFMA) instructions and restructuring thread layouts to 64-thread wavefronts.

Key takeaway

For MLOps Engineers deploying LLMs on AMD Instinct GPUs, FlashInfer on ROCm significantly enhances inference efficiency. You should integrate this library to leverage optimized prefill and decode kernels, especially for models using GQA/MQA, to improve throughput and memory utilization. Consider using the provided Docker images for a streamlined setup and explore the AITER backend for specific prefill operations.

Key insights

FlashInfer on ROCm optimizes LLM inference on AMD GPUs by specializing attention kernels for prefill and decode phases.

Principles

Method

The porting process involved restructuring four core computational stages: loading query matrices, streaming key/value data, computing query-key dot products, and performing softmax-value multiplication, specifically replacing NVIDIA's wmma with CDNA3/CDNA4 MFMA instructions.

In practice

Topics

Code references

Best for: MLOps Engineer, NLP Engineer, Machine Learning Engineer, AI Engineer, AI Hardware Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.