How to serve Kimi-K2-Instruct on Lambda with vLLM

· Source: The Lambda Deep Learning Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure · Depth: Intermediate, short

Summary

This article details the deployment of Moonshot AI's Kimi-K2-Instruct, a one-trillion-parameter Mixture-of-Experts (MoE) language model, on Lambda's 8x NVIDIA Blackwell GPU instances using vLLM for efficient multi-GPU inference. Kimi-K2-Instruct, licensed under MIT, features a 128K context window and requires 959GB of disk space and 1,347GB of idle vRAM, making single-GPU operation impractical. The deployment process involves spinning up an 8x NVIDIA Blackwell GPU instance, configuring a vLLM server with specific parameters like `--tensor-parallel-size 8`, and then benchmarking the server using `vllm bench serve`. Benchmark results show an output generation throughput of 219.644 ± 0.497 tokens per second and a total throughput of 332.358 ± 0.810 tokens per second, with a mean time to first token of 148.986 ± 6.619 ms.

Key takeaway

For MLOps Engineers deploying large language models that exceed single-GPU memory, you should consider multi-GPU serving solutions like vLLM on cloud instances such as Lambda's 8x Blackwell GPUs. This approach enables efficient inference for models like Kimi-K2-Instruct, which demands over a terabyte of VRAM. Ensure you benchmark your deployment to validate performance metrics like time-to-first-token and overall throughput before production.

Key insights

Deploying large MoE models like Kimi-K2-Instruct requires multi-GPU inference solutions such as vLLM on specialized hardware.

Principles

Method

Deploy Kimi-K2-Instruct by launching an 8x Blackwell GPU instance on Lambda, starting a vLLM server with `tensor-parallel-size 8`, and then benchmarking using `vllm bench serve` with sleep mode enabled for accurate measurements.

In practice

Topics

Best for: Machine Learning Engineer, MLOps Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The Lambda Deep Learning Blog.