PersistentKV: Page-Aware Decode Scheduling for Long-Context LLM Serving on Commodity GPUs

· Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, quick

Summary

PersistentKV, a novel native block-table decode attention engine, addresses the key-value (KV) cache movement bottleneck in long-context large language model (LLM) serving on commodity GPUs. This system, designed for grouped-query attention (GQA), reuses K,V tiles across grouped query heads, supports native page tables, and implements a compact workqueue schedule that executes only non-empty tasks. Benchmarking on an RTX 3060 with FP16, page size 16, Hq=32, Hkv=8, and d=128, demonstrated that a calibrated adaptive policy selecting between FlashInfer and PersistentKV improved synchronized wall throughput by 1.063-1.265x on B8 bimodal, uniform, and Zipf-like workloads, and by 1.399x on a B1 bucketed trace. The policy also successfully avoided regression on B4 bimodal workloads by choosing FlashInfer.

Key takeaway

For AI Engineers deploying long-context LLMs on commodity GPUs, you should prioritize adaptive page-aware decode scheduling to overcome KV cache movement bottlenecks. This approach, exemplified by PersistentKV, significantly improves synchronized wall throughput, especially for B1 and B8 long-context steps. Evaluate integrating such dynamic scheduling policies to maximize GPU utilization and enhance serving performance, avoiding regressions seen with static single-kernel solutions.

Key insights

Adaptive page-aware decode scheduling significantly boosts long-context LLM serving throughput on commodity GPUs by optimizing KV cache movement.

Principles

Method

PersistentKV maps work by KV-head group, reuses K,V tiles, supports native page tables, and employs a compact workqueue schedule executing only non-empty row-KV-head-sequence-split tasks.

In practice

Topics

Best for: MLOps Engineer, Research Scientist, Machine Learning Engineer, AI Engineer, AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.