PersistentKV: Page-Aware Decode Scheduling for Long-Context LLM Serving on Commodity GPUs

2026-06-25 · Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, medium

Summary

PersistentKV, a page-aware decode scheduling engine, addresses key-value (KV) cache movement limitations and GPU under-utilization in long-context LLM serving, particularly on commodity GPUs. It maps work by KV-head group, reuses K,V tiles, supports native page tables, and employs a compact workqueue schedule for non-empty tasks. On an RTX 3060 with FP16, page size 16, Hq=32, Hkv=8, and d=128, an adaptive policy combining FlashInfer for small active batches and PersistentKV for long-context steps achieved 1.063-1.265x synchronized wall throughput improvement on B8 bimodal, uniform, and Zipf-like workloads, and 1.399x on a B1 bucketed trace. This demonstrates that work assignment, not just attention math, is a decisive serving-system variable.

Key takeaway

For MLOps Engineers optimizing long-context LLM inference on commodity GPUs, you should implement adaptive page-aware decode scheduling. This approach, combining specialized kernels like FlashInfer for small batches with PersistentKV's workqueue for long-context steps, significantly improves throughput. Evaluate your specific workloads to calibrate thresholds and split counts, ensuring you avoid performance regressions on boundary cases. Prioritize work assignment strategies over solely optimizing attention math for better serving system performance.

Key insights

Adaptive page-aware decode scheduling significantly boosts long-context LLM serving throughput on commodity GPUs by optimizing work assignment.

Principles

KV cache movement limits LLM serving.
Best single-kernel isn't always best schedule.
Work assignment is decisive for serving.

Method

PersistentKV uses a native block-table decode attention engine with a compact workqueue schedule, executing only non-empty row-KV-head-sequence-split tasks, and reuses K,V tiles across grouped query heads.

In practice

Use adaptive policies for mixed workloads.
Combine FlashInfer for small batches.
Apply PersistentKV for long-context steps.

Topics

LLM Serving
KV Cache Optimization
Page-Aware Scheduling
Commodity GPUs
FlashInfer
Grouped-Query Attention

Code references

rajveerb/stream2llm

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.