3 Agents. 3 LLMs. 1 Aging GPU: Engineering Parallel Inference on Bare Metal

2026-06-25 · Source: Towards Data Science · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, long

Summary

The "lmxd" C++ daemon addresses the challenge of running multiple distinct Large Language Models (LLMs) concurrently on a single, older GPU with limited VRAM, such as an NVIDIA GTX 1080 with 8 GB. Traditional `llama.cpp` processes often fail due to aggressive KV cache pre-reservation, leading to out-of-memory errors. "lmxd" acts as a centralized bookkeeper, owning the GPU and admitting agents based on a 90% VRAM cap. It uses a Unix-socket protocol for agents to register and decode, ensuring only one `llama_backend_init` per GPU and refcounting shared GGUF models. Crucially, it implements KV-cache eviction to host RAM via `lmx::KvSwapHelper` during agent switching, allowing suspended agents to consume zero VRAM. This enables three models (SmolLM2-360M, Qwen2-0.5B, Llama-3.2-1B) to run on the GTX 1080, using 1.58 GB against a 7.73 GB ceiling, where naive parallel execution fails.

Key takeaway

For MLOps Engineers deploying multiple LLM agents on resource-constrained GPUs, you can overcome VRAM limitations and out-of-memory errors by adopting a centralized resource management daemon. This approach, exemplified by "lmxd", allows you to run several small models concurrently on a single 8 GB GPU by intelligently managing KV cache and model loading. Implement VRAM bookkeeping and KV-cache swapping to host RAM to maximize GPU utilization and ensure stable multi-agent operations.

Key insights

A C++ daemon enables parallel LLM inference on limited GPU VRAM by centralizing resource management and KV-cache swapping.

Principles

Centralized VRAM bookkeeping prevents overcommitment.
Pre-allocating KV cache causes OOM on shared GPUs.
Overlap compute with memory transfer for efficiency.

Method

The "lmxd" daemon uses a Unix-socket protocol for agents to `REGISTER` and `DECODE`. It enforces a 90% VRAM cap, loads models once, refcounts them, and swaps KV caches to host RAM for inactive agents.

In practice

Implement a VRAM ledger for multi-agent LLM deployments.
Use `cudaHostAlloc` and CUDA streams for layer streaming.
Serialize/deserialize KV cache to host memory for context switching.

Topics

LLM Inference
GPU Resource Management
VRAM Optimization
Multi-Agent Systems
KV Cache Swapping
C++ Daemon

Code references

Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.