Tangram: Unlocking Non-Uniform KV Cache for Efficient Multi-turn LLM Serving
Summary
Tangram is a novel serving system designed to make non-uniform Key-Value (KV) cache compression practical for multi-turn Large Language Model (LLM) serving. Multi-turn LLMs face severe GPU memory and bandwidth pressure due to linear KV cache growth. While non-uniform compression preserves accuracy by retaining critical information per attention head, it introduces systemic challenges like memory fragmentation, scheduling complexities, and diminished kernel utilization in existing systems like vLLM. Tangram addresses these through three core techniques: Deterministic Budget Allocation, Head Group Page, and Ahead-of-Time (AOT) Load Balancing. Experimental results show Tangram improves throughput by up to 2.6x compared to baselines, fully preserving model accuracy on Qwen3-4B, Qwen2.5-7B-Instruct-1M, and Qwen2.5-32B models across various long-context benchmarks.
Key takeaway
For AI Architects and ML Engineers deploying multi-turn LLMs, Tangram offers a critical solution to scale inference efficiently. If your current serving system struggles with KV cache memory pressure and throughput bottlenecks when using non-uniform compression, you should consider adopting Tangram's deterministic approach. Its techniques, like Head Group Page and AOT Load Balancing, can significantly boost throughput by up to 2.6x while maintaining accuracy, making long-context LLM serving more practical.
Key insights
Head-wise KV cache retention patterns are stable and model-intrinsic, enabling deterministic optimization for non-uniform compression.
Principles
- Per-head KV retention is stable and model-intrinsic
- Dynamic memory management introduces prohibitive overhead
- Uniform workload assumptions limit GPU efficiency
Method
Tangram profiles head-wise budgets offline, clusters heads into independent page tables, and pre-computes optimal GPU workload distributions for balanced execution.
In practice
- Profile per-head KV retention once per model
- Group attention heads by similar retention demands
- Pre-calculate GPU workload partitions offline
Topics
- LLM Serving
- KV Cache Compression
- Non-uniform KV Cache
- Memory Management
- GPU Optimization
- Multi-turn LLMs
- vLLM
Code references
Best for: MLOps Engineer, AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.