Prefill/Decode Disaggregation: Why Your GPU Can’t Do Two Things at Once

· Source: Towards AI - Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Cloud Computing & IT Infrastructure · Depth: Advanced, long

Summary

LLM inference, particularly with modern large language models, faces a significant systems problem where its two distinct phases, prefill (processing input tokens) and decoding (generating one token at a time), demand completely different hardware behaviors. Prefill is compute-bound, benefiting from tensor parallelism, while decode is memory-bound and suffers from communication overhead. This fundamental conflict means optimizing for one phase on shared GPUs compromises the other, leading to issues like head-of-line blocking. Prefill/decode disaggregation resolves this by creating separate, specialized GPU clusters for each phase. While transferring the KV cache between clusters incurs a cost, this can be mitigated by overlapping transfers, using fast interconnects like NVLink, and compressing the KV cache to INT8. This approach, introduced by the 2023 Splitwise paper, has seen rapid adoption by 2024 in systems like SGLang, vLLM, and Mooncake, becoming essential as context windows expanded from 4k to 1M tokens.

Key takeaway

For AI Engineers optimizing LLM inference, recognize that prefill and decode phases demand distinct hardware strategies. If you are experiencing unpredictable latency or sluggish responses, consider implementing prefill/decode disaggregation. This approach, by dedicating specialized GPU pools, resolves the inherent conflicts between compute-bound prefill and memory-bound decode, significantly improving both Time To First Token (TTFT) and Time Per Output Token (TPOT). While KV cache transfer introduces overhead, strategic mitigation makes the performance gains worthwhile for large context windows.

Key insights

LLM inference's prefill and decode phases have conflicting hardware demands, necessitating disaggregation for optimal performance and user experience.

Principles

Method

Disaggregation involves separate GPU clusters for prefill (optimized for compute, tensor parallelism) and decode (optimized for memory, concurrency). KV cache transfers between clusters, with mitigation strategies like overlapping and compression.

In practice

Topics

Best for: MLOps Engineer, CTO, VP of Engineering/Data, AI Engineer, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Towards AI - Medium.