Ragged Paged Attention: A High-Performance and Flexible LLM Inference Kernel for TPU
Summary
Ragged Paged Attention (RPA) is a high-performance and flexible attention kernel designed for Large Language Model (LLM) inference on Google's Tensor Processing Units (TPUs). Developed using Pallas and Mosaic, RPA addresses the challenges of dynamic and ragged execution patterns prevalent in modern LLM serving, which are not well-supported by existing GPU-centric kernels. RPA employs three core techniques: fine-grained tiling for efficient dynamic slicing over ragged memory, a custom software pipeline that fuses KV cache updates with attention computation, and a distribution-aware compilation strategy that creates specialized kernels for decode, prefill, and mixed workloads. Evaluated on Llama 3 8B on TPU7x, RPA achieves up to 86% memory bandwidth utilization (MBU) in decode and 73% model FLOPs utilization (MFU) in prefill. It is integrated into vLLM and SGLang as a production-grade TPU backend.
Key takeaway
For AI Architects and MLOps Engineers deploying LLMs on TPUs, RPA offers a critical solution for optimizing inference performance and TCO. Its specialized kernel design for dynamic workloads, including fused KV cache updates and distribution-aware compilation, directly addresses the limitations of GPU-centric approaches. Consider adopting RPA within vLLM or SGLang to significantly improve memory bandwidth and FLOPs utilization for your TPU-based LLM deployments, especially with models like Llama 3 8B.
Key insights
Ragged Paged Attention optimizes LLM inference on TPUs for dynamic workloads via fine-grained tiling and fused operations.
Principles
- Fuse KV cache updates with attention.
- Specialize kernels for workload types.
- Utilize fine-grained tiling for ragged memory.
Method
RPA uses fine-grained tiling for dynamic slicing, a software pipeline fusing KV cache updates with attention, and distribution-aware compilation to generate specialized kernels for decode, prefill, and mixed LLM workloads on TPUs.
In practice
- Integrate RPA for Llama 3 8B on TPU7x.
- Use Pallas and Mosaic for kernel development.
- Optimize for ragged memory patterns.
Topics
- Ragged Paged Attention
- LLM Inference
- Tensor Processing Units
- KV Cache
- Pallas and Mosaic
Best for: AI Architect, MLOps Engineer, CTO, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.