Running Multiple Models on One GPU with vLLM and GPU Memory Utilization

· Source: Andrej Baranovskij · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure · Depth: Intermediate, quick

Summary

The provided content details a method for controlling GPU memory allocation when running large language model (LLM) inference using the vLLM framework on Linux with Nvidia GPUs. The author, Andre, utilized a 96 GB Nvidia GPU for Sparrow, an open-source solution for structured data extraction. The challenge involved loading two different models, Mistral small 3.2 (24 billion parameters) and a smaller dots OCR model, simultaneously into memory to avoid slow loading times during sequential inference requests. vLLM offers a `gpu_memory_utilization` parameter, allowing users to predefine the percentage of GPU memory allocated to each model. For instance, 70% was allocated to Mistral and 20% to dots OCR, leaving 10% free. This functionality enabled both models to be loaded, cached, and kept in memory for efficient inference serving.

Key takeaway

For AI Engineers managing multiple LLMs on a single Nvidia GPU, leveraging vLLM's `gpu_memory_utilization` parameter is crucial. This allows you to pre-allocate specific memory percentages for each model, ensuring both are loaded and cached simultaneously. This approach significantly reduces model loading overhead between inference requests, optimizing throughput and responsiveness for diverse use cases without requiring additional hardware.

Key insights

vLLM's `gpu_memory_utilization` parameter enables efficient multi-model inference on a single GPU by pre-allocating memory.

Principles

Method

Use vLLM's `gpu_memory_utilization` parameter to specify memory percentages for each model, allowing multiple models to reside in GPU memory simultaneously for faster inference.

In practice

Topics

Best for: Machine Learning Engineer, AI Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Andrej Baranovskij.