NetKV: Network-Aware Decode Instance Selection for Disaggregated LLM Inference

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure · Depth: Expert, quick

Summary

NetKV is a novel system designed for disaggregated Large Language Model (LLM) inference that optimizes decode instance selection by considering network topology and dynamic congestion. In disaggregated setups, the KV cache must traverse the datacenter network, directly impacting Time to First Token (TTFT). Existing schedulers primarily focus on compute load and prefix-cache locality, overlooking critical network transfer times. NetKV introduces a thin operator-to-scheduler interface and a network cost oracle, proving that ignoring network terms leads to suboptimal scheduling as context length increases. This O(|D|) per-request greedy algorithm demonstrates robustness to stale telemetry. Evaluated on a 64-GPU four-tier fat-tree simulator using Mooncake traces, NetKV reduced mean TTFT by up to 21.2% compared to round-robin and 17.6% against a tuned cache+load-aware scheduler, while boosting SLO attainment by up to 20.1 percentage points and maintaining Time Between Tokens overhead below 0.5 ms, all without hardware or engine modifications.

Key takeaway

For AI Architects and ML Engineers optimizing disaggregated LLM inference, you should prioritize network-aware scheduling to significantly improve Time to First Token (TTFT). Ignoring network latency in KV cache placement can lead to substantial performance bottlenecks, especially with longer contexts. Implement solutions like NetKV's network cost oracle to factor in topological distance and dynamic congestion, potentially reducing TTFT by over 20% and boosting Service Level Objective attainment without requiring hardware changes.

Key insights

Network-aware scheduling for disaggregated LLM inference significantly reduces TTFT by optimizing KV cache placement.

Principles

Method

NetKV employs a network cost oracle and an O(|D|) per-request greedy algorithm for decode instance selection, integrating network topology and congestion awareness into scheduling decisions.

In practice

Topics

Best for: MLOps Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.