Running AI on mixed hardware for speed and affordability

2026-06-23 · Source: IBM Research · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure · Depth: Advanced, short

Summary

The open-source orchestrator llm-d addresses the challenges of deploying large language models (LLMs) on-premises, particularly on mixed GPU hardware, to enhance performance and control costs. Developed by the open-source community and further optimized by IBM Research, Red Hat, and NxtGen Cloud Technologies, llm-d uses a cache-aware router to efficiently distribute inference requests. This router tracks the key-value (KV) cache state of vLLM instances in real-time, directing incoming requests to instances likely to hold pre-computed data. Experiments on the NxtGen sovereign cloud demonstrated that llm-d could run IBM Granite and Sarvam AI models 3-5 times faster and serve potentially twice as many users compared to traditional Kubernetes setups. It achieved 14,200 tokens per second under heavy traffic with heterogeneous pods, significantly outperforming Kubernetes' 7,500 tokens per second, and could save up to \$5.25 million annually for a Sarvam-30B model serving 1,000 users.

Key takeaway

For MLOps Engineers deploying LLMs on-premises, llm-d offers a compelling solution to optimize performance and control infrastructure costs. You can significantly boost throughput and reduce latency by leveraging its cache-aware routing across mixed GPU clusters, potentially serving twice as many users 3-5 times faster. This allows you to utilize existing, diverse hardware more effectively, avoiding expensive upgrades and achieving substantial annual savings, such as up to \$5.25 million for a Sarvam-30B model.

Key insights

llm-d optimizes LLM inference on mixed GPU clusters by intelligently routing requests based on KV cache state, significantly boosting throughput and reducing costs.

Principles

Cache-aware routing improves LLM inference efficiency.
Decoupling prefill and decoding optimizes hardware use.
Heterogeneous hardware can be unified for cost savings.

Method

llm-d employs a cache-aware router to direct incoming LLM inference requests to vLLM instances holding matching prompt prefixes in their KV cache, separating prefill and decoding steps for dedicated hardware optimization.

In practice

Deploy llm-d for multi-vendor GPU LLM serving.
Utilize older GPUs for lower-priority LLM tasks.
Optimize prefill/decoding on distinct hardware pools.

Topics

LLM Inference
GPU Orchestration
Heterogeneous Computing
KV Cache Optimization
On-premises AI
vLLM

Best for: CTO, Director of AI/ML, VP of Engineering/Data, AI Engineer, MLOps Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by IBM Research.