Prefill Is Compute-Bound. Decode Is Memory-Bound. Why Your GPU Shouldn’t Do Both.
Summary
Disaggregated inference is an architecture that splits Large Language Model (LLM) serving into two distinct phases: prefill and decode, each running on separate, optimized hardware pools. This approach addresses the significant utilization mismatch observed in monolithic serving, where GPUs are overprovisioned for one phase while underutilized in the other. For instance, an H100 GPU can hit 92% utilization during compute-bound prefill but drop to 28-30% during memory-bound decode. By separating these phases, disaggregation allows for independent scaling and hardware right-sizing, leading to reported infrastructure cost reductions of 15-40% and throughput gains of 2x to 6.4x. Key components include a KV-aware router, a prefill pool for compute-intensive tasks, and a decode pool for memory-intensive token generation, with KV-cache transfer between them, often via RDMA.
Key takeaway
For AI Architects and MLOps Engineers scaling LLM inference, disaggregated serving offers substantial cost savings and latency control by optimizing hardware utilization. You should evaluate your workload's prefill-to-decode ratio, KV-cache size, prefix cache hit rate, GPU count (ideally >16), and network capabilities (RDMA, >100 Gbps). If favorable, implementing disaggregation, starting with vLLM's native support, can significantly reduce per-token serving costs and improve inter-token latency.
Key insights
Disaggregated inference optimizes LLM serving costs and latency by separating compute-bound prefill from memory-bound decode.
Principles
- LLM inference has two distinct phases with opposite hardware needs.
- Monolithic serving leads to significant GPU underutilization and cost waste.
- Independent scaling of prefill and decode pools improves efficiency.
Method
Disaggregated inference routes requests to a prefill pool, transfers the KV-cache to a decode pool via a fast network, and then generates tokens autoregressively. This requires a KV-aware router and specialized hardware pools.
In practice
- Use vLLM's built-in disaggregated prefilling mode.
- Measure prefill-to-decode time ratio for workload assessment.
- Audit network for RDMA capability and 100 Gbps links.
Topics
- Disaggregated Inference
- LLM Serving Optimization
- GPU Resource Management
- KV-Cache Transfer
- Real-time LLM Inference
Best for: MLOps Engineer, AI Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.