I Deployed Local LLMs in Production for a Year. Part 2: The Operational Playbook
Summary
This operational guide details best practices for deploying local Large Language Models (LLMs) like Ollama, llama.cpp, and vLLM in production environments, focusing on configuration, tradeoffs, and observability. It emphasizes that default settings are often suboptimal for production, particularly regarding context length, KV cache management, and model keep-alive behavior. The article provides practical examples, such as serving Llama 3.1 8B on a 24GB GPU, to illustrate how KV cache quantization (e.g., q8_0 or fp8) and prefix caching can significantly improve throughput and reduce VRAM usage without substantial quality degradation. It also outlines critical environment variables and flags for each runtime, discusses the inherent tradeoffs between memory, context length, throughput, latency, and simplicity, and highlights common deployment mistakes like misinterpreting GPU utilization or relying on synthetic benchmarks.
Key takeaway
For MLOps Engineers deploying local LLMs, prioritize explicit configuration over defaults. Measure your actual prompt and output length distributions to correctly set `num_ctx` and parallel slots, and always enable KV cache quantization and Flash Attention on supported hardware. Implement robust observability for prefill/decode times and VRAM usage, and load-test with real traffic patterns to avoid unexpected production failures and ensure consistent performance for your users.
Key insights
Optimizing local LLM deployments requires careful configuration of context, KV cache, and parallelism to balance memory, throughput, and latency.
Principles
- KV cache often consumes more VRAM than model weights.
- Defaults are for demos, not production deployments.
- Measure with real traffic, not synthetic benchmarks.
Method
Configure LLM runtimes by setting context length to the 95th percentile of actual requests, enabling KV cache quantization, and utilizing prefix caching for shared prefixes to optimize VRAM and latency.
In practice
- Use `OLLAMA_KV_CACHE_TYPE=q8_0` for reduced VRAM.
- Enable `--enable-prefix-caching` for RAG workloads.
- Set `OLLAMA_KEEP_ALIVE=-1` for dedicated serving nodes.
Topics
- Local LLM Deployment
- KV Cache Optimization
- LLM Quantization
- Prefix Caching
- LLM Observability
Code references
Best for: MLOps Engineer, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AI Advances - Medium.