Shipping LLMs (Part 2/6): What’s Actually in Your KV Cache?
Summary
The KV cache, a critical component in Transformer-based Large Language Models (LLMs), stores per-token key/value tensors during decoding to prevent recomputing attention over the entire prefix for each new token. This cache consists of one K-vector and one V-vector per layer, per attention head, per token. For a 7B model with 32 layers, 32 heads, and a head dimension of 128 using FP16, the KV cache consumes approximately 256 KB per token. This translates to 1 GB for a 4k context and 8 GB for a 32k context per request, making context windows memory-intensive, especially when considering batch size. The KV cache is identified as the second-largest memory consumer on the GPU and is a primary cause of CUDA Out-Of-Memory (OOM) errors during LLM inference.
Key takeaway
For AI Engineers optimizing LLM inference, understanding the KV cache's memory footprint is crucial. Your GPU's "free" memory might be misleading, as the KV cache can consume significant resources, especially with longer contexts and larger batch sizes. Prioritize strategies to manage or reduce KV cache size, such as prompt caching stable prefixes, to avoid CUDA OOM errors and improve overall system throughput.
Key insights
The KV cache stores per-token key/value tensors, significantly impacting LLM memory consumption and context window cost.
Principles
- KV cache size scales with context length and batch size.
- Context window cost is dominated by KV cache memory, not compute.
In practice
- Monitor KV cache usage to prevent CUDA OOM errors.
- Optimize context window length for memory efficiency.
Topics
- KV Cache
- LLM Inference
- Transformer Decoding
- Memory Management
- Context Window
Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by LLM on Medium.