PersistentKV: Page-Aware Decode Scheduling for Long-Context LLM Serving on Commodity GPUs

2026-06-25 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, quick

Summary

PersistentKV, a novel native block-table decode attention engine, addresses the key-value (KV) cache movement bottleneck in long-context large language model (LLM) serving on commodity GPUs. This system, designed for grouped-query attention (GQA), reuses K,V tiles across grouped query heads, supports native page tables, and implements a compact workqueue schedule that executes only non-empty tasks. Benchmarking on an RTX 3060 with FP16, page size 16, Hq=32, Hkv=8, and d=128, demonstrated that a calibrated adaptive policy selecting between FlashInfer and PersistentKV improved synchronized wall throughput by 1.063-1.265x on B8 bimodal, uniform, and Zipf-like workloads, and by 1.399x on a B1 bucketed trace. The policy also successfully avoided regression on B4 bimodal workloads by choosing FlashInfer.

Key takeaway

For AI Engineers deploying long-context LLMs on commodity GPUs, you should prioritize adaptive page-aware decode scheduling to overcome KV cache movement bottlenecks. This approach, exemplified by PersistentKV, significantly improves synchronized wall throughput, especially for B1 and B8 long-context steps. Evaluate integrating such dynamic scheduling policies to maximize GPU utilization and enhance serving performance, avoiding regressions seen with static single-kernel solutions.

Key insights

Adaptive page-aware decode scheduling significantly boosts long-context LLM serving throughput on commodity GPUs by optimizing KV cache movement.

Principles

KV cache movement limits LLM serving.
Adaptive scheduling outperforms single-kernel implementations.
Work assignment is critical for serving throughput.

Method

PersistentKV maps work by KV-head group, reuses K,V tiles, supports native page tables, and employs a compact workqueue schedule executing only non-empty row-KV-head-sequence-split tasks.

In practice

Implement adaptive decode scheduling.
Optimize KV-head group work mapping.
Calibrate policies for specific workloads.

Topics

Long-Context LLMs
KV Cache Optimization
GPU Serving
Page-Aware Scheduling
Grouped-Query Attention
PersistentKV

Best for: MLOps Engineer, Research Scientist, Machine Learning Engineer, AI Engineer, AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.