Running Multiple Models on One GPU with vLLM and GPU Memory Utilization
Summary
The provided content details a method for controlling GPU memory allocation when running large language model (LLM) inference using the vLLM framework on Linux with Nvidia GPUs. The author, Andre, utilized a 96 GB Nvidia GPU for Sparrow, an open-source solution for structured data extraction. The challenge involved loading two different models, Mistral small 3.2 (24 billion parameters) and a smaller dots OCR model, simultaneously into memory to avoid slow loading times during sequential inference requests. vLLM offers a `gpu_memory_utilization` parameter, allowing users to predefine the percentage of GPU memory allocated to each model. For instance, 70% was allocated to Mistral and 20% to dots OCR, leaving 10% free. This functionality enabled both models to be loaded, cached, and kept in memory for efficient inference serving.
Key takeaway
For AI Engineers managing multiple LLMs on a single Nvidia GPU, leveraging vLLM's `gpu_memory_utilization` parameter is crucial. This allows you to pre-allocate specific memory percentages for each model, ensuring both are loaded and cached simultaneously. This approach significantly reduces model loading overhead between inference requests, optimizing throughput and responsiveness for diverse use cases without requiring additional hardware.
Key insights
vLLM's `gpu_memory_utilization` parameter enables efficient multi-model inference on a single GPU by pre-allocating memory.
Principles
- Caching models in memory improves inference speed.
- Pre-allocating GPU memory prevents OOM errors.
Method
Use vLLM's `gpu_memory_utilization` parameter to specify memory percentages for each model, allowing multiple models to reside in GPU memory simultaneously for faster inference.
In practice
- Allocate 70% for a 24B parameter model.
- Allocate 20% for a smaller OCR model.
Topics
- vLLM Framework
- GPU Memory Allocation
- Large Language Model Inference
- NVIDIA GPU
- Mistral Model
Best for: Machine Learning Engineer, AI Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Andrej Baranovskij.