PersistentKV: Page-Aware Decode Scheduling for Long-Context LLM Serving on Commodity GPUs
Summary
PersistentKV, a novel native block-table decode attention engine, addresses the key-value (KV) cache movement bottleneck in long-context large language model (LLM) serving on commodity GPUs. This system, designed for grouped-query attention (GQA), reuses K,V tiles across grouped query heads, supports native page tables, and implements a compact workqueue schedule that executes only non-empty tasks. Benchmarking on an RTX 3060 with FP16, page size 16, Hq=32, Hkv=8, and d=128, demonstrated that a calibrated adaptive policy selecting between FlashInfer and PersistentKV improved synchronized wall throughput by 1.063-1.265x on B8 bimodal, uniform, and Zipf-like workloads, and by 1.399x on a B1 bucketed trace. The policy also successfully avoided regression on B4 bimodal workloads by choosing FlashInfer.
Key takeaway
For AI Engineers deploying long-context LLMs on commodity GPUs, you should prioritize adaptive page-aware decode scheduling to overcome KV cache movement bottlenecks. This approach, exemplified by PersistentKV, significantly improves synchronized wall throughput, especially for B1 and B8 long-context steps. Evaluate integrating such dynamic scheduling policies to maximize GPU utilization and enhance serving performance, avoiding regressions seen with static single-kernel solutions.
Key insights
Adaptive page-aware decode scheduling significantly boosts long-context LLM serving throughput on commodity GPUs by optimizing KV cache movement.
Principles
- KV cache movement limits LLM serving.
- Adaptive scheduling outperforms single-kernel implementations.
- Work assignment is critical for serving throughput.
Method
PersistentKV maps work by KV-head group, reuses K,V tiles, supports native page tables, and employs a compact workqueue schedule executing only non-empty row-KV-head-sequence-split tasks.
In practice
- Implement adaptive decode scheduling.
- Optimize KV-head group work mapping.
- Calibrate policies for specific workloads.
Topics
- Long-Context LLMs
- KV Cache Optimization
- GPU Serving
- Page-Aware Scheduling
- Grouped-Query Attention
- PersistentKV
Best for: MLOps Engineer, Research Scientist, Machine Learning Engineer, AI Engineer, AI Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.