Shipping LLMs (Part 2/6): What’s Actually in Your KV Cache?

· Source: LLM on Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, quick

Summary

The KV cache, a critical component in Transformer-based Large Language Models (LLMs), stores per-token key/value tensors during decoding to prevent recomputing attention over the entire prefix for each new token. This cache consists of one K-vector and one V-vector per layer, per attention head, per token. For a 7B model with 32 layers, 32 heads, and a head dimension of 128 using FP16, the KV cache consumes approximately 256 KB per token. This translates to 1 GB for a 4k context and 8 GB for a 32k context per request, making context windows memory-intensive, especially when considering batch size. The KV cache is identified as the second-largest memory consumer on the GPU and is a primary cause of CUDA Out-Of-Memory (OOM) errors during LLM inference.

Key takeaway

For AI Engineers optimizing LLM inference, understanding the KV cache's memory footprint is crucial. Your GPU's "free" memory might be misleading, as the KV cache can consume significant resources, especially with longer contexts and larger batch sizes. Prioritize strategies to manage or reduce KV cache size, such as prompt caching stable prefixes, to avoid CUDA OOM errors and improve overall system throughput.

Key insights

The KV cache stores per-token key/value tensors, significantly impacting LLM memory consumption and context window cost.

Principles

In practice

Topics

Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by LLM on Medium.