Prefill/Decode Disaggregation: Why Your GPU Can’t Do Two Things at Once
Summary
LLM inference, particularly with modern large language models, faces a significant systems problem where its two distinct phases, prefill (processing input tokens) and decoding (generating one token at a time), demand completely different hardware behaviors. Prefill is compute-bound, benefiting from tensor parallelism, while decode is memory-bound and suffers from communication overhead. This fundamental conflict means optimizing for one phase on shared GPUs compromises the other, leading to issues like head-of-line blocking. Prefill/decode disaggregation resolves this by creating separate, specialized GPU clusters for each phase. While transferring the KV cache between clusters incurs a cost, this can be mitigated by overlapping transfers, using fast interconnects like NVLink, and compressing the KV cache to INT8. This approach, introduced by the 2023 Splitwise paper, has seen rapid adoption by 2024 in systems like SGLang, vLLM, and Mooncake, becoming essential as context windows expanded from 4k to 1M tokens.
Key takeaway
For AI Engineers optimizing LLM inference, recognize that prefill and decode phases demand distinct hardware strategies. If you are experiencing unpredictable latency or sluggish responses, consider implementing prefill/decode disaggregation. This approach, by dedicating specialized GPU pools, resolves the inherent conflicts between compute-bound prefill and memory-bound decode, significantly improving both Time To First Token (TTFT) and Time Per Output Token (TPOT). While KV cache transfer introduces overhead, strategic mitigation makes the performance gains worthwhile for large context windows.
Key insights
LLM inference's prefill and decode phases have conflicting hardware demands, necessitating disaggregation for optimal performance and user experience.
Principles
- Prefill is compute-bound; decode is memory-bound.
- Optimizing one inference phase hurts the other.
- Disaggregation separates conflicting LLM workloads.
Method
Disaggregation involves separate GPU clusters for prefill (optimized for compute, tensor parallelism) and decode (optimized for memory, concurrency). KV cache transfers between clusters, with mitigation strategies like overlapping and compression.
In practice
- Overlap KV cache transfer with compute.
- Use fast interconnects like NVLink.
- Compress KV cache to INT8 for transfer.
Topics
- LLM Inference Optimization
- Prefill/Decode Disaggregation
- GPU Parallelism
- KV Cache Management
- Time To First Token
- Time Per Output Token
Best for: MLOps Engineer, CTO, VP of Engineering/Data, AI Engineer, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Towards AI - Medium.