Running AI on mixed hardware for speed and affordability
Summary
The open-source orchestrator llm-d addresses the challenges of deploying large language models (LLMs) on-premises, particularly on mixed GPU hardware, to enhance performance and control costs. Developed by the open-source community and further optimized by IBM Research, Red Hat, and NxtGen Cloud Technologies, llm-d uses a cache-aware router to efficiently distribute inference requests. This router tracks the key-value (KV) cache state of vLLM instances in real-time, directing incoming requests to instances likely to hold pre-computed data. Experiments on the NxtGen sovereign cloud demonstrated that llm-d could run IBM Granite and Sarvam AI models 3-5 times faster and serve potentially twice as many users compared to traditional Kubernetes setups. It achieved 14,200 tokens per second under heavy traffic with heterogeneous pods, significantly outperforming Kubernetes' 7,500 tokens per second, and could save up to \$5.25 million annually for a Sarvam-30B model serving 1,000 users.
Key takeaway
For MLOps Engineers deploying LLMs on-premises, llm-d offers a compelling solution to optimize performance and control infrastructure costs. You can significantly boost throughput and reduce latency by leveraging its cache-aware routing across mixed GPU clusters, potentially serving twice as many users 3-5 times faster. This allows you to utilize existing, diverse hardware more effectively, avoiding expensive upgrades and achieving substantial annual savings, such as up to \$5.25 million for a Sarvam-30B model.
Key insights
llm-d optimizes LLM inference on mixed GPU clusters by intelligently routing requests based on KV cache state, significantly boosting throughput and reducing costs.
Principles
- Cache-aware routing improves LLM inference efficiency.
- Decoupling prefill and decoding optimizes hardware use.
- Heterogeneous hardware can be unified for cost savings.
Method
llm-d employs a cache-aware router to direct incoming LLM inference requests to vLLM instances holding matching prompt prefixes in their KV cache, separating prefill and decoding steps for dedicated hardware optimization.
In practice
- Deploy llm-d for multi-vendor GPU LLM serving.
- Utilize older GPUs for lower-priority LLM tasks.
- Optimize prefill/decoding on distinct hardware pools.
Topics
- LLM Inference
- GPU Orchestration
- Heterogeneous Computing
- KV Cache Optimization
- On-premises AI
- vLLM
Best for: CTO, Director of AI/ML, VP of Engineering/Data, AI Engineer, MLOps Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by IBM Research.