Ragged Paged Attention: A High-Performance and Flexible LLM Inference Kernel for TPU
Summary
Ragged Paged Attention (RPA) is a high-performance attention kernel designed for Google's Tensor Processing Units (TPUs), specifically addressing the challenges of dynamic and "ragged" LLM inference workloads. Implemented using Pallas and Mosaic, RPA introduces three key techniques: fine-grained tiling for efficient dynamic slicing over ragged memory, a custom software pipeline that fuses KV cache updates with attention computation, and a distribution-aware compilation strategy for specialized kernels. Evaluated on Llama 3 8B on TPU7x, RPA achieves up to 86% memory bandwidth utilization (MBU) in decode and 73% model FLOPs utilization (MFU) in prefill. It supports various models, TPU generations (v4-v7), and workloads (decode, prefill, mixed batches), and has been integrated as the primary TPU backend in vLLM and SGLang, demonstrating a 2x-5x increase in token throughput for vLLM-TPU since February 2025.
Key takeaway
For AI engineers and CTOs deploying LLMs on TPUs, RPA offers a critical solution for overcoming performance bottlenecks associated with dynamic, ragged inference patterns. Its integration into vLLM and SGLang demonstrates significant throughput gains, making it essential for optimizing TCO and scaling LLM serving. You should consider adopting RPA for production-grade TPU inference to maximize hardware utilization and reduce latency, especially for mixed workloads.
Key insights
RPA optimizes LLM inference on TPUs by handling dynamic workloads and fusing KV cache updates for high utilization.
Principles
- Overlap data movement with computation to hide latency.
- Align memory layouts with hardware DMA granularity.
- Specialize kernels for distinct workload patterns.
Method
RPA uses fine-grained tiling, a custom software pipeline fusing KV cache updates with attention, and distribution-aware compilation to optimize LLM inference on TPUs for dynamic, ragged workloads.
In practice
- Use mini-batching for decode to saturate HBM bandwidth.
- Precompute metadata to overlap scalar and vector execution.
- Tune block sizes offline for specific workload types.
Topics
- Ragged Paged Attention
- TPU Inference
- LLM Inference Kernels
- KV Cache Optimization
- Pallas/Mosaic
Code references
Best for: AI Engineer, CTO, VP of Engineering/Data, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.