I Deployed Local LLMs in Production for a Year. Part 2: The Operational Playbook

· Source: AI Advances - Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, extended

Summary

This operational guide details best practices for deploying local Large Language Models (LLMs) like Ollama, llama.cpp, and vLLM in production environments, focusing on configuration, tradeoffs, and observability. It emphasizes that default settings are often suboptimal for production, particularly regarding context length, KV cache management, and model keep-alive behavior. The article provides practical examples, such as serving Llama 3.1 8B on a 24GB GPU, to illustrate how KV cache quantization (e.g., q8_0 or fp8) and prefix caching can significantly improve throughput and reduce VRAM usage without substantial quality degradation. It also outlines critical environment variables and flags for each runtime, discusses the inherent tradeoffs between memory, context length, throughput, latency, and simplicity, and highlights common deployment mistakes like misinterpreting GPU utilization or relying on synthetic benchmarks.

Key takeaway

For MLOps Engineers deploying local LLMs, prioritize explicit configuration over defaults. Measure your actual prompt and output length distributions to correctly set `num_ctx` and parallel slots, and always enable KV cache quantization and Flash Attention on supported hardware. Implement robust observability for prefill/decode times and VRAM usage, and load-test with real traffic patterns to avoid unexpected production failures and ensure consistent performance for your users.

Key insights

Optimizing local LLM deployments requires careful configuration of context, KV cache, and parallelism to balance memory, throughput, and latency.

Principles

Method

Configure LLM runtimes by setting context length to the 95th percentile of actual requests, enabling KV cache quantization, and utilizing prefix caching for shared prefixes to optimize VRAM and latency.

In practice

Topics

Code references

Best for: MLOps Engineer, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AI Advances - Medium.