NetKV: Network-Aware Decode Instance Selection for Disaggregated LLM Inference
Summary
NetKV is a novel system designed for disaggregated Large Language Model (LLM) inference that optimizes decode instance selection by considering network topology and dynamic congestion. In disaggregated setups, the KV cache must traverse the datacenter network, directly impacting Time to First Token (TTFT). Existing schedulers primarily focus on compute load and prefix-cache locality, overlooking critical network transfer times. NetKV introduces a thin operator-to-scheduler interface and a network cost oracle, proving that ignoring network terms leads to suboptimal scheduling as context length increases. This O(|D|) per-request greedy algorithm demonstrates robustness to stale telemetry. Evaluated on a 64-GPU four-tier fat-tree simulator using Mooncake traces, NetKV reduced mean TTFT by up to 21.2% compared to round-robin and 17.6% against a tuned cache+load-aware scheduler, while boosting SLO attainment by up to 20.1 percentage points and maintaining Time Between Tokens overhead below 0.5 ms, all without hardware or engine modifications.
Key takeaway
For AI Architects and ML Engineers optimizing disaggregated LLM inference, you should prioritize network-aware scheduling to significantly improve Time to First Token (TTFT). Ignoring network latency in KV cache placement can lead to substantial performance bottlenecks, especially with longer contexts. Implement solutions like NetKV's network cost oracle to factor in topological distance and dynamic congestion, potentially reducing TTFT by over 20% and boosting Service Level Objective attainment without requiring hardware changes.
Key insights
Network-aware scheduling for disaggregated LLM inference significantly reduces TTFT by optimizing KV cache placement.
Principles
- Network topology impacts LLM inference performance.
- Cache-aware scheduling alone is suboptimal for long contexts.
- Robustness to stale telemetry is achievable.
Method
NetKV employs a network cost oracle and an O(|D|) per-request greedy algorithm for decode instance selection, integrating network topology and congestion awareness into scheduling decisions.
In practice
- Integrate network cost into LLM inference schedulers.
- Prioritize TTFT optimization in disaggregated setups.
- Design schedulers robust to network telemetry delays.
Topics
- Disaggregated LLM Inference
- KV Cache Optimization
- Network-Aware Scheduling
- Time to First Token
- Datacenter Networks
- LLM Performance
Best for: MLOps Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.