I Deployed Local LLMs in Production for a Year. Part 1: The Mental Model
Summary
This article, Part 1 of a two-part series, provides a hands-on guide to deploying local LLMs like Ollama, llama.cpp, and vLLM in production environments, moving beyond basic tutorials. It details seven critical aspects often overlooked, including the two-phase prefill/decode request execution, the hidden FIFO queue in "simple APIs" like Ollama's, and the significant VRAM consumption by the KV cache. The author explains that model loading is a three-phase process (disk to RAM, RAM to VRAM, plus GPU warmup) and that default configurations for tools like Ollama are optimized for single-developer use, not production serving. Specific examples include Ollama's default `num_ctx` of 4096 tokens and `OLLAMA_NUM_PARALLEL=1`, which creates silent latency issues. The article also contrasts Ollama's static batching with vLLM's continuous batching and PagedAttention for superior throughput.
Key takeaway
For MLOps Engineers deploying local LLMs, understanding the underlying mechanics of request processing, memory allocation, and model loading is crucial. You must explicitly configure parameters like `num_ctx`, `OLLAMA_NUM_PARALLEL`, and `OLLAMA_KEEP_ALIVE` to avoid silent performance bottlenecks and OOM errors. Your deployment strategy should account for the KV cache's VRAM impact and the distinct prefill/decode phases to optimize for your specific traffic patterns, potentially opting for runtimes like vLLM for higher concurrency.
Key insights
Production LLM deployment requires understanding hidden queues, KV cache memory, and multi-phase model loading beyond basic tutorials.
Principles
- LLM requests have distinct compute-bound prefill and memory-bound decode phases.
- KV cache size, not just model weights, dictates VRAM usage and scaling limits.
- Default LLM runtime configurations are often unsuitable for production workloads.
Method
Calculate KV cache size using the formula: `2 × context_length × num_layers × num_kv_heads × head_dim × bytes_per_element` to predict VRAM consumption accurately.
In practice
- Measure both "time to first token" and "tokens per second" for LLM performance.
- Set container memory limits to at least 1.5× model file size to prevent silent paging.
- Disable swap on LLM serving nodes to avoid latency spikes.
Topics
- Local LLM Deployment
- KV Cache
- Prefill and Decode
- Ollama
- vLLM
Code references
Best for: Machine Learning Engineer, MLOps Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AI Advances - Medium.