I Deployed Local LLMs in Production for a Year. Part 1: The Mental Model

· Source: AI Advances - Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, extended

Summary

This article, Part 1 of a two-part series, provides a hands-on guide to deploying local LLMs like Ollama, llama.cpp, and vLLM in production environments, moving beyond basic tutorials. It details seven critical aspects often overlooked, including the two-phase prefill/decode request execution, the hidden FIFO queue in "simple APIs" like Ollama's, and the significant VRAM consumption by the KV cache. The author explains that model loading is a three-phase process (disk to RAM, RAM to VRAM, plus GPU warmup) and that default configurations for tools like Ollama are optimized for single-developer use, not production serving. Specific examples include Ollama's default `num_ctx` of 4096 tokens and `OLLAMA_NUM_PARALLEL=1`, which creates silent latency issues. The article also contrasts Ollama's static batching with vLLM's continuous batching and PagedAttention for superior throughput.

Key takeaway

For MLOps Engineers deploying local LLMs, understanding the underlying mechanics of request processing, memory allocation, and model loading is crucial. You must explicitly configure parameters like `num_ctx`, `OLLAMA_NUM_PARALLEL`, and `OLLAMA_KEEP_ALIVE` to avoid silent performance bottlenecks and OOM errors. Your deployment strategy should account for the KV cache's VRAM impact and the distinct prefill/decode phases to optimize for your specific traffic patterns, potentially opting for runtimes like vLLM for higher concurrency.

Key insights

Production LLM deployment requires understanding hidden queues, KV cache memory, and multi-phase model loading beyond basic tutorials.

Principles

Method

Calculate KV cache size using the formula: `2 × context_length × num_layers × num_kv_heads × head_dim × bytes_per_element` to predict VRAM consumption accurately.

In practice

Topics

Code references

Best for: Machine Learning Engineer, MLOps Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AI Advances - Medium.