LLaMA-2 70B Has 64 Query Heads and 8 KV Heads. Here Is the Memory Arithmetic Nobody Shows You.
Summary
The LLaMA-2 70B model employs Grouped Query Attention (GQA) to address the KV cache memory bottleneck during inference, a critical deployment challenge for large language models. While common explanations of GQA are often vague, this analysis clarifies its practical implications. LLaMA-2 70B specifically features 64 query heads and 8 key-value (KV) heads. GQA reduces the KV cache memory footprint by allowing multiple query heads to share a single set of key and value vectors. This sharing mechanism ensures that attention patterns remain distinct for each query head, despite the reduced memory usage for storing K and V vectors, which are typically recomputed or cached for every previous token.
Key takeaway
For MLOps engineers optimizing LLaMA-2 70B deployments, understanding Grouped Query Attention's memory arithmetic is crucial. Your KV cache memory footprint is significantly reduced by its 8 shared KV heads, enabling longer context windows or larger batch sizes. Prioritize analyzing memory usage with GQA in mind to maximize inference efficiency and manage hardware constraints effectively.
Key insights
Grouped Query Attention (GQA) in LLaMA-2 70B significantly reduces KV cache memory by sharing KV heads.
Principles
- KV cache size is a primary deployment bottleneck for LLMs.
- GQA enables distinct attention patterns with shared KV heads.
In practice
- LLaMA-2 70B uses 64 query heads and 8 KV heads.
- GQA reduces memory by sharing K and V vectors across query groups.
Topics
- LLaMA-2 70B
- Grouped Query Attention
- KV Cache
- Memory Optimization
- Transformer Inference
- Large Language Models
Best for: Machine Learning Engineer, AI Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Towards AI - Medium.