Ragged Paged Attention: A High-Performance and Flexible LLM Inference Kernel for TPU

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Cloud Computing & IT Infrastructure · Depth: Expert, extended

Summary

Ragged Paged Attention (RPA) is a high-performance attention kernel designed for Google's Tensor Processing Units (TPUs), specifically addressing the challenges of dynamic and "ragged" LLM inference workloads. Implemented using Pallas and Mosaic, RPA introduces three key techniques: fine-grained tiling for efficient dynamic slicing over ragged memory, a custom software pipeline that fuses KV cache updates with attention computation, and a distribution-aware compilation strategy for specialized kernels. Evaluated on Llama 3 8B on TPU7x, RPA achieves up to 86% memory bandwidth utilization (MBU) in decode and 73% model FLOPs utilization (MFU) in prefill. It supports various models, TPU generations (v4-v7), and workloads (decode, prefill, mixed batches), and has been integrated as the primary TPU backend in vLLM and SGLang, demonstrating a 2x-5x increase in token throughput for vLLM-TPU since February 2025.

Key takeaway

For AI engineers and CTOs deploying LLMs on TPUs, RPA offers a critical solution for overcoming performance bottlenecks associated with dynamic, ragged inference patterns. Its integration into vLLM and SGLang demonstrates significant throughput gains, making it essential for optimizing TCO and scaling LLM serving. You should consider adopting RPA for production-grade TPU inference to maximize hardware utilization and reduce latency, especially for mixed workloads.

Key insights

RPA optimizes LLM inference on TPUs by handling dynamic workloads and fusing KV cache updates for high utilization.

Principles

Method

RPA uses fine-grained tiling, a custom software pipeline fusing KV cache updates with attention, and distribution-aware compilation to optimize LLM inference on TPUs for dynamic, ragged workloads.

In practice

Topics

Code references

Best for: AI Engineer, CTO, VP of Engineering/Data, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.