How to serve Kimi-K2-Instruct on Lambda with vLLM
Summary
This article details the deployment of Moonshot AI's Kimi-K2-Instruct, a one-trillion-parameter Mixture-of-Experts (MoE) language model, on Lambda's 8x NVIDIA Blackwell GPU instances using vLLM for efficient multi-GPU inference. Kimi-K2-Instruct, licensed under MIT, features a 128K context window and requires 959GB of disk space and 1,347GB of idle vRAM, making single-GPU operation impractical. The deployment process involves spinning up an 8x NVIDIA Blackwell GPU instance, configuring a vLLM server with specific parameters like `--tensor-parallel-size 8`, and then benchmarking the server using `vllm bench serve`. Benchmark results show an output generation throughput of 219.644 ± 0.497 tokens per second and a total throughput of 332.358 ± 0.810 tokens per second, with a mean time to first token of 148.986 ± 6.619 ms.
Key takeaway
For MLOps Engineers deploying large language models that exceed single-GPU memory, you should consider multi-GPU serving solutions like vLLM on cloud instances such as Lambda's 8x Blackwell GPUs. This approach enables efficient inference for models like Kimi-K2-Instruct, which demands over a terabyte of VRAM. Ensure you benchmark your deployment to validate performance metrics like time-to-first-token and overall throughput before production.
Key insights
Deploying large MoE models like Kimi-K2-Instruct requires multi-GPU inference solutions such as vLLM on specialized hardware.
Principles
- Large models necessitate multi-GPU serving.
- Benchmarking is crucial for production efficiency.
Method
Deploy Kimi-K2-Instruct by launching an 8x Blackwell GPU instance on Lambda, starting a vLLM server with `tensor-parallel-size 8`, and then benchmarking using `vllm bench serve` with sleep mode enabled for accurate measurements.
In practice
- Use `tensor-parallel-size 8` for 8 GPUs.
- Employ `vllm bench serve` for throughput metrics.
- Activate sleep mode to reset server state.
Topics
- Kimi-K2-Instruct
- vLLM
- Multi-GPU Inference
- Mixture-of-Experts
- LLM Deployment
Best for: Machine Learning Engineer, MLOps Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by The Lambda Deep Learning Blog.