How I am Running Multiple LLMs on a Single H100 — Challenges and Resolution

· Source: LLM on Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure, Software Development & Engineering · Depth: Intermediate, medium

Summary

This article details a three-phase architectural evolution for running multiple Large Language Models (LLMs) on a single NVIDIA H100 80GB GPU for an intelligent document processing pipeline. Initially, concurrent model loading led to non-deterministic VRAM allocation issues, causing models like Qwen 2.5 14B and PaddleOCR-VL 1.5 to crash. The first resolution involved implementing a Strict Sequential Booting strategy using Docker health checks and static VRAM allocation (e.g., 50% for Qwen, 30% for Paddle). While stable, this approach was rigid and inefficient for "spiky" workloads. The architecture then evolved to dynamic virtualization using the kvcached library, which provides a liquid memory pool, allowing models to dynamically borrow VRAM as needed. Finally, a production-grade gateway was established using LiteLLM as a unified OpenAI-compatible proxy, configured with Gunicorn and `MAX_REQUESTS_BEFORE_RESTART` for long-term stability, achieving a resting state of ~58GB for multiple models.

Key takeaway

For AI Engineers deploying multiple LLMs on a single GPU, prioritize dynamic VRAM allocation over static partitioning to maximize hardware utilization and handle variable workloads efficiently. You should implement sequential model booting with health checks to ensure stability, then integrate a virtual memory solution like `kvcached` to enable flexible resource sharing. Additionally, consider using a unified proxy like LiteLLM with Gunicorn for robust, scalable, and maintainable production deployments.

Key insights

Dynamic VRAM allocation and sequential booting are crucial for efficient multi-LLM deployment on single GPUs.

Principles

Method

Implement sequential booting with Docker health checks, then transition to dynamic VRAM virtualization using kvcached. Abstract models behind a unified proxy like LiteLLM with Gunicorn for production hardening.

In practice

Topics

Code references

Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by LLM on Medium.