Ragged Paged Attention: A High-Performance and Flexible LLM Inference Kernel for TPU

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Cloud Computing & IT Infrastructure · Depth: Expert, quick

Summary

Ragged Paged Attention (RPA) is a high-performance and flexible attention kernel designed for Large Language Model (LLM) inference on Google's Tensor Processing Units (TPUs). Developed using Pallas and Mosaic, RPA addresses the challenges of dynamic and ragged execution patterns prevalent in modern LLM serving, which are not well-supported by existing GPU-centric kernels. RPA employs three core techniques: fine-grained tiling for efficient dynamic slicing over ragged memory, a custom software pipeline that fuses KV cache updates with attention computation, and a distribution-aware compilation strategy that creates specialized kernels for decode, prefill, and mixed workloads. Evaluated on Llama 3 8B on TPU7x, RPA achieves up to 86% memory bandwidth utilization (MBU) in decode and 73% model FLOPs utilization (MFU) in prefill. It is integrated into vLLM and SGLang as a production-grade TPU backend.

Key takeaway

For AI Architects and MLOps Engineers deploying LLMs on TPUs, RPA offers a critical solution for optimizing inference performance and TCO. Its specialized kernel design for dynamic workloads, including fused KV cache updates and distribution-aware compilation, directly addresses the limitations of GPU-centric approaches. Consider adopting RPA within vLLM or SGLang to significantly improve memory bandwidth and FLOPs utilization for your TPU-based LLM deployments, especially with models like Llama 3 8B.

Key insights

Ragged Paged Attention optimizes LLM inference on TPUs for dynamic workloads via fine-grained tiling and fused operations.

Principles

Method

RPA uses fine-grained tiling for dynamic slicing, a software pipeline fusing KV cache updates with attention, and distribution-aware compilation to generate specialized kernels for decode, prefill, and mixed LLM workloads on TPUs.

In practice

Topics

Best for: AI Architect, MLOps Engineer, CTO, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.