How I am Running Multiple LLMs on a Single H100 — Challenges and Resolution
Summary
This article details a three-phase architectural evolution for running multiple Large Language Models (LLMs) on a single NVIDIA H100 80GB GPU for an intelligent document processing pipeline. Initially, concurrent model loading led to non-deterministic VRAM allocation issues, causing models like Qwen 2.5 14B and PaddleOCR-VL 1.5 to crash. The first resolution involved implementing a Strict Sequential Booting strategy using Docker health checks and static VRAM allocation (e.g., 50% for Qwen, 30% for Paddle). While stable, this approach was rigid and inefficient for "spiky" workloads. The architecture then evolved to dynamic virtualization using the kvcached library, which provides a liquid memory pool, allowing models to dynamically borrow VRAM as needed. Finally, a production-grade gateway was established using LiteLLM as a unified OpenAI-compatible proxy, configured with Gunicorn and `MAX_REQUESTS_BEFORE_RESTART` for long-term stability, achieving a resting state of ~58GB for multiple models.
Key takeaway
For AI Engineers deploying multiple LLMs on a single GPU, prioritize dynamic VRAM allocation over static partitioning to maximize hardware utilization and handle variable workloads efficiently. You should implement sequential model booting with health checks to ensure stability, then integrate a virtual memory solution like `kvcached` to enable flexible resource sharing. Additionally, consider using a unified proxy like LiteLLM with Gunicorn for robust, scalable, and maintainable production deployments.
Key insights
Dynamic VRAM allocation and sequential booting are crucial for efficient multi-LLM deployment on single GPUs.
Principles
- Avoid concurrent model loading on shared GPU resources.
- Static VRAM allocation can be inefficient for variable workloads.
Method
Implement sequential booting with Docker health checks, then transition to dynamic VRAM virtualization using kvcached. Abstract models behind a unified proxy like LiteLLM with Gunicorn for production hardening.
In practice
- Use `kvcached` for dynamic GPU memory management.
- Employ LiteLLM as a unified API gateway.
- Configure Gunicorn with `MAX_REQUESTS_BEFORE_RESTART`.
Topics
- LLM Deployment
- GPU Memory Management
- kvcached
- vLLM
- LiteLLM
Code references
Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by LLM on Medium.