MiniPIC: Flexible Position-Independent Caching in <100LOC
Summary
MiniPIC is a novel, minimalistic design for Position-Independent Caching (PIC) in vLLM, implemented with fewer than 100 lines of core engine changes. It addresses the limitations of traditional prefix caching and existing PIC solutions by introducing a positional-encoding-free KV cache and user-controlled cache-reuse primitives. MiniPIC stores unrotated K vectors, applies RoPE within attention using per-request logical positions, and exposes three token-level primitives: block-aligned padding, SSep, and PDep. This approach enables flexible reuse of recurring "spans" in retrieval-augmented and agentic workloads. Benchmarks on 2WikiMultihopQA show MiniPIC with interleaved scheduling improves prefill throughput by 49% over baseline vLLM, reduces cached-span time-to-first-token by up to two orders of magnitude, and incurs only 5.7% worst-case overhead. The core implementation requires only 78 LOC, with 61 LOC being new functionality.
Key takeaway
For AI Engineers and ML Architects optimizing LLM serving for RAG or agentic workloads, MiniPIC offers a high-performance, low-overhead solution for position-independent caching. You should consider integrating MiniPIC's position-free KV cache and user-controlled primitives (padding, SSep, PDep) into your vLLM deployments. This approach can significantly boost prefill throughput by 49% and reduce time-to-first-token for cached spans, avoiding complex engine modifications and memory overheads associated with traditional PIC methods.
Key insights
Position-independent caching in LLM inference is simplified by separating positional encodings from the KV cache and using user-controlled primitives.
Principles
- Positional encodings in shared KV caches create concurrency conflicts.
- Flexible PIC requires position-free KV and user-controlled reuse.
- Align PIC implementations with existing engine architecture.
Method
MiniPIC stores unrotated K vectors, applies RoPE inside attention using per-request logical positions, and uses special tokens (padding, SSep, PDep) to modify block-hashing behavior for span reuse.
In practice
- Pad spans to KV-block boundaries for canonical layout.
- Use SSep to make spans independently cacheable.
- Employ PDep to ensure suffix blocks depend on full context.
Topics
- Position-Independent Caching
- LLM Inference Optimization
- vLLM
- Retrieval-Augmented Generation
- KV Cache
- Rotary Position Embeddings
Code references
Best for: MLOps Engineer, NLP Engineer, AI Scientist, AI Engineer, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.