Tangram: Unlocking Non-Uniform KV Cache for Efficient Multi-turn LLM Serving

2026-06-05 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Cloud Computing & IT Infrastructure · Depth: Expert, extended

Summary

Tangram is a novel serving system designed to make non-uniform Key-Value (KV) cache compression practical for multi-turn Large Language Model (LLM) serving. Multi-turn LLMs face severe GPU memory and bandwidth pressure due to linear KV cache growth. While non-uniform compression preserves accuracy by retaining critical information per attention head, it introduces systemic challenges like memory fragmentation, scheduling complexities, and diminished kernel utilization in existing systems like vLLM. Tangram addresses these through three core techniques: Deterministic Budget Allocation, Head Group Page, and Ahead-of-Time (AOT) Load Balancing. Experimental results show Tangram improves throughput by up to 2.6x compared to baselines, fully preserving model accuracy on Qwen3-4B, Qwen2.5-7B-Instruct-1M, and Qwen2.5-32B models across various long-context benchmarks.

Key takeaway

For AI Architects and ML Engineers deploying multi-turn LLMs, Tangram offers a critical solution to scale inference efficiently. If your current serving system struggles with KV cache memory pressure and throughput bottlenecks when using non-uniform compression, you should consider adopting Tangram's deterministic approach. Its techniques, like Head Group Page and AOT Load Balancing, can significantly boost throughput by up to 2.6x while maintaining accuracy, making long-context LLM serving more practical.

Key insights

Head-wise KV cache retention patterns are stable and model-intrinsic, enabling deterministic optimization for non-uniform compression.

Principles

Per-head KV retention is stable and model-intrinsic
Dynamic memory management introduces prohibitive overhead
Uniform workload assumptions limit GPU efficiency

Method

Tangram profiles head-wise budgets offline, clusters heads into independent page tables, and pre-computes optimal GPU workload distributions for balanced execution.

In practice

Profile per-head KV retention once per model
Group attention heads by similar retention demands
Pre-calculate GPU workload partitions offline

Topics

LLM Serving
KV Cache Compression
Non-uniform KV Cache
Memory Management
GPU Optimization
Multi-turn LLMs
vLLM

Code references

aiha-lab/TANGRAM

Best for: MLOps Engineer, AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.