PersistentKV: Page-Aware Decode Scheduling for Long-Context LLM Serving on Commodity GPUs
Summary
PersistentKV, a page-aware decode scheduling engine, addresses key-value (KV) cache movement limitations and GPU under-utilization in long-context LLM serving, particularly on commodity GPUs. It maps work by KV-head group, reuses K,V tiles, supports native page tables, and employs a compact workqueue schedule for non-empty tasks. On an RTX 3060 with FP16, page size 16, Hq=32, Hkv=8, and d=128, an adaptive policy combining FlashInfer for small active batches and PersistentKV for long-context steps achieved 1.063-1.265x synchronized wall throughput improvement on B8 bimodal, uniform, and Zipf-like workloads, and 1.399x on a B1 bucketed trace. This demonstrates that work assignment, not just attention math, is a decisive serving-system variable.
Key takeaway
For MLOps Engineers optimizing long-context LLM inference on commodity GPUs, you should implement adaptive page-aware decode scheduling. This approach, combining specialized kernels like FlashInfer for small batches with PersistentKV's workqueue for long-context steps, significantly improves throughput. Evaluate your specific workloads to calibrate thresholds and split counts, ensuring you avoid performance regressions on boundary cases. Prioritize work assignment strategies over solely optimizing attention math for better serving system performance.
Key insights
Adaptive page-aware decode scheduling significantly boosts long-context LLM serving throughput on commodity GPUs by optimizing work assignment.
Principles
- KV cache movement limits LLM serving.
- Best single-kernel isn't always best schedule.
- Work assignment is decisive for serving.
Method
PersistentKV uses a native block-table decode attention engine with a compact workqueue schedule, executing only non-empty row-KV-head-sequence-split tasks, and reuses K,V tiles across grouped query heads.
In practice
- Use adaptive policies for mixed workloads.
- Combine FlashInfer for small batches.
- Apply PersistentKV for long-context steps.
Topics
- LLM Serving
- KV Cache Optimization
- Page-Aware Scheduling
- Commodity GPUs
- FlashInfer
- Grouped-Query Attention
Code references
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.