Unlock Massive Token Throughput with GPU Fractioning in NVIDIA Run:ai
Summary
NVIDIA and AI cloud provider Nebius benchmarked NVIDIA Run:ai's fractional GPU allocation for large language model (LLM) inference, demonstrating significant efficiency gains. The tests, conducted on NVIDIA H100 NVL and HGX B200 GPUs using NVIDIA NIM microservices, showed that fractional GPUs dramatically increase effective capacity without compromising latency. Key findings include 77% of full GPU throughput and 86% of concurrent user capacity with 0.5 GPU fractions, and up to 2x more concurrent users for smaller models with 0.25 GPU fractions. The system also supported up to 3x more total users when running mixed workloads (chat, reasoning, embeddings) on shared GPUs, with near-linear throughput scaling and production-ready autoscaling. This validates fractional GPU scheduling as a foundational capability for efficient, large-scale, multimodel LLM inference in production environments, both on-premises and in the cloud.
Key takeaway
For Machine Learning Engineers and CTOs managing LLM inference infrastructure, adopting NVIDIA Run:ai with fractional GPU allocation can significantly improve GPU utilization and user capacity. You can achieve up to 3x more total system users and maintain sub-second time to first token (TTFT) by dynamically sharing GPU resources, thereby reducing the need for proportional increases in hardware investment and enabling more elastic, cost-effective deployments.
Key insights
Fractional GPU allocation significantly boosts LLM inference capacity and efficiency without sacrificing latency.
Principles
- Fractional GPUs enable higher concurrency.
- Dynamic scheduling optimizes resource allocation.
- Mixed workloads benefit from co-location.
Method
Benchmarking involved simulating concurrent users with GenAI Perf on NVIDIA H100 NVL and HGX B200 GPUs, comparing native Kubernetes with NVIDIA Run:ai at full, 0.5, 0.25, and mixed fractional GPU allocations.
In practice
- Deploy NVIDIA Run:ai for LLM inference.
- Utilize 0.5 or 0.25 GPU fractions for smaller models.
- Co-locate diverse LLM workloads on shared GPUs.
Topics
- NVIDIA Run:ai
- GPU Fractioning
- LLM Inference
- Workload Scheduling
- NVIDIA NIM Microservices
Code references
Best for: Machine Learning Engineer, NLP Engineer, CTO, MLOps Engineer, AI Engineer, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.