Unlock Massive Token Throughput with GPU Fractioning in NVIDIA Run:ai

2026-02-18 · Source: NVIDIA Technical Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure, MLOps & AI Infrastructure Optimization · Depth: Advanced, long

Summary

NVIDIA and AI cloud provider Nebius benchmarked NVIDIA Run:ai's fractional GPU allocation for large language model (LLM) inference, demonstrating significant efficiency gains. The tests, conducted on NVIDIA H100 NVL and HGX B200 GPUs using NVIDIA NIM microservices, showed that fractional GPUs dramatically increase effective capacity without compromising latency. Key findings include 77% of full GPU throughput and 86% of concurrent user capacity with 0.5 GPU fractions, and up to 2x more concurrent users for smaller models with 0.25 GPU fractions. The system also supported up to 3x more total users when running mixed workloads (chat, reasoning, embeddings) on shared GPUs, with near-linear throughput scaling and production-ready autoscaling. This validates fractional GPU scheduling as a foundational capability for efficient, large-scale, multimodel LLM inference in production environments, both on-premises and in the cloud.

Key takeaway

For Machine Learning Engineers and CTOs managing LLM inference infrastructure, adopting NVIDIA Run:ai with fractional GPU allocation can significantly improve GPU utilization and user capacity. You can achieve up to 3x more total system users and maintain sub-second time to first token (TTFT) by dynamically sharing GPU resources, thereby reducing the need for proportional increases in hardware investment and enabling more elastic, cost-effective deployments.

Key insights

Fractional GPU allocation significantly boosts LLM inference capacity and efficiency without sacrificing latency.

Principles

Fractional GPUs enable higher concurrency.
Dynamic scheduling optimizes resource allocation.
Mixed workloads benefit from co-location.

Method

Benchmarking involved simulating concurrent users with GenAI Perf on NVIDIA H100 NVL and HGX B200 GPUs, comparing native Kubernetes with NVIDIA Run:ai at full, 0.5, 0.25, and mixed fractional GPU allocations.

In practice

Deploy NVIDIA Run:ai for LLM inference.
Utilize 0.5 or 0.25 GPU fractions for smaller models.
Co-locate diverse LLM workloads on shared GPUs.

Topics

NVIDIA Run:ai
GPU Fractioning
LLM Inference
Workload Scheduling
NVIDIA NIM Microservices

Code references

triton-inference-server/client

Best for: Machine Learning Engineer, NLP Engineer, CTO, MLOps Engineer, AI Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.