Ragged Paged Attention: A High-Performance and Flexible LLM Inference Kernel for TPU

2025-10-29 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Cloud Computing & IT Infrastructure · Depth: Expert, extended

Summary

Ragged Paged Attention (RPA) is a high-performance attention kernel designed for Google's Tensor Processing Units (TPUs), specifically addressing the challenges of dynamic and "ragged" LLM inference workloads. Implemented using Pallas and Mosaic, RPA introduces three key techniques: fine-grained tiling for efficient dynamic slicing over ragged memory, a custom software pipeline that fuses KV cache updates with attention computation, and a distribution-aware compilation strategy for specialized kernels. Evaluated on Llama 3 8B on TPU7x, RPA achieves up to 86% memory bandwidth utilization (MBU) in decode and 73% model FLOPs utilization (MFU) in prefill. It supports various models, TPU generations (v4-v7), and workloads (decode, prefill, mixed batches), and has been integrated as the primary TPU backend in vLLM and SGLang, demonstrating a 2x-5x increase in token throughput for vLLM-TPU since February 2025.

Key takeaway

For AI engineers and CTOs deploying LLMs on TPUs, RPA offers a critical solution for overcoming performance bottlenecks associated with dynamic, ragged inference patterns. Its integration into vLLM and SGLang demonstrates significant throughput gains, making it essential for optimizing TCO and scaling LLM serving. You should consider adopting RPA for production-grade TPU inference to maximize hardware utilization and reduce latency, especially for mixed workloads.

Key insights

RPA optimizes LLM inference on TPUs by handling dynamic workloads and fusing KV cache updates for high utilization.

Principles

Overlap data movement with computation to hide latency.
Align memory layouts with hardware DMA granularity.
Specialize kernels for distinct workload patterns.

Method

RPA uses fine-grained tiling, a custom software pipeline fusing KV cache updates with attention, and distribution-aware compilation to optimize LLM inference on TPUs for dynamic, ragged workloads.

In practice

Use mini-batching for decode to saturate HBM bandwidth.
Precompute metadata to overlap scalar and vector execution.
Tune block sizes offline for specific workload types.

Topics

Ragged Paged Attention
TPU Inference
LLM Inference Kernels
KV Cache Optimization
Pallas/Mosaic

Code references

vllm-project/tpu-inference

Best for: AI Engineer, CTO, VP of Engineering/Data, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.