MiniPIC: Flexible Position-Independent Caching in <100LOC

2026-05-18 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

MiniPIC is a novel, minimalistic design for Position-Independent Caching (PIC) in vLLM, implemented with fewer than 100 lines of core engine changes. It addresses the limitations of traditional prefix caching and existing PIC solutions by introducing a positional-encoding-free KV cache and user-controlled cache-reuse primitives. MiniPIC stores unrotated K vectors, applies RoPE within attention using per-request logical positions, and exposes three token-level primitives: block-aligned padding, SSep, and PDep. This approach enables flexible reuse of recurring "spans" in retrieval-augmented and agentic workloads. Benchmarks on 2WikiMultihopQA show MiniPIC with interleaved scheduling improves prefill throughput by 49% over baseline vLLM, reduces cached-span time-to-first-token by up to two orders of magnitude, and incurs only 5.7% worst-case overhead. The core implementation requires only 78 LOC, with 61 LOC being new functionality.

Key takeaway

For AI Engineers and ML Architects optimizing LLM serving for RAG or agentic workloads, MiniPIC offers a high-performance, low-overhead solution for position-independent caching. You should consider integrating MiniPIC's position-free KV cache and user-controlled primitives (padding, SSep, PDep) into your vLLM deployments. This approach can significantly boost prefill throughput by 49% and reduce time-to-first-token for cached spans, avoiding complex engine modifications and memory overheads associated with traditional PIC methods.

Key insights

Position-independent caching in LLM inference is simplified by separating positional encodings from the KV cache and using user-controlled primitives.

Principles

Positional encodings in shared KV caches create concurrency conflicts.
Flexible PIC requires position-free KV and user-controlled reuse.
Align PIC implementations with existing engine architecture.

Method

MiniPIC stores unrotated K vectors, applies RoPE inside attention using per-request logical positions, and uses special tokens (padding, SSep, PDep) to modify block-hashing behavior for span reuse.

In practice

Pad spans to KV-block boundaries for canonical layout.
Use SSep to make spans independently cacheable.
Employ PDep to ensure suffix blocks depend on full context.

Topics

Position-Independent Caching
LLM Inference Optimization
vLLM
Retrieval-Augmented Generation
KV Cache
Rotary Position Embeddings

Code references

Best for: MLOps Engineer, NLP Engineer, AI Scientist, AI Engineer, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.