I Deployed Local LLMs in Production for a Year. Part 1: The Mental Model

2026-05-04 · Source: AI Advances - Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, extended

Summary

This article, Part 1 of a two-part series, provides a hands-on guide to deploying local LLMs like Ollama, llama.cpp, and vLLM in production environments, moving beyond basic tutorials. It details seven critical aspects often overlooked, including the two-phase prefill/decode request execution, the hidden FIFO queue in "simple APIs" like Ollama's, and the significant VRAM consumption by the KV cache. The author explains that model loading is a three-phase process (disk to RAM, RAM to VRAM, plus GPU warmup) and that default configurations for tools like Ollama are optimized for single-developer use, not production serving. Specific examples include Ollama's default `num_ctx` of 4096 tokens and `OLLAMA_NUM_PARALLEL=1`, which creates silent latency issues. The article also contrasts Ollama's static batching with vLLM's continuous batching and PagedAttention for superior throughput.

Key takeaway

For MLOps Engineers deploying local LLMs, understanding the underlying mechanics of request processing, memory allocation, and model loading is crucial. You must explicitly configure parameters like `num_ctx`, `OLLAMA_NUM_PARALLEL`, and `OLLAMA_KEEP_ALIVE` to avoid silent performance bottlenecks and OOM errors. Your deployment strategy should account for the KV cache's VRAM impact and the distinct prefill/decode phases to optimize for your specific traffic patterns, potentially opting for runtimes like vLLM for higher concurrency.

Key insights

Production LLM deployment requires understanding hidden queues, KV cache memory, and multi-phase model loading beyond basic tutorials.

Principles

LLM requests have distinct compute-bound prefill and memory-bound decode phases.
KV cache size, not just model weights, dictates VRAM usage and scaling limits.
Default LLM runtime configurations are often unsuitable for production workloads.

Method

Calculate KV cache size using the formula: `2 × context_length × num_layers × num_kv_heads × head_dim × bytes_per_element` to predict VRAM consumption accurately.

In practice

Measure both "time to first token" and "tokens per second" for LLM performance.
Set container memory limits to at least 1.5× model file size to prevent silent paging.
Disable swap on LLM serving nodes to avoid latency spikes.

Topics

Local LLM Deployment
KV Cache
Prefill and Decode
Ollama
vLLM

Code references

Best for: Machine Learning Engineer, MLOps Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AI Advances - Medium.