Unlocking High-Performance Inference for DeepSeek with NVFP4 on NVIDIA Blackwell

· Source: Microsoft Foundry Blog articles · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure, Software Development & Engineering · Depth: Advanced, medium

Summary

Microsoft and NVIDIA partnered to optimize single-node inference for the 690-billion-parameter DeepSeek-V3.2 Mixture-of-Experts (MoE) model on NVIDIA Blackwell architecture. Experiments on a single NVIDIA GB200 node (2 Grace Blackwell superchips, 4 Blackwell GPUs) using NVIDIA's NVFP4 checkpoint for DeepSeek-V3.2 and NVIDIA TensorRT LLM demonstrated significant performance gains. This configuration achieved up to 2.5x lower per-user latency compared to NVIDIA H200 GPUs and could serve up to 16 times more users per GPU while maintaining a consistent latency target. The optimization involved hardware (GB200 NVL72), NVFP4-quantized model weights (reducing memory footprint by 1.7x from 690 GB to 415 GB), and the TensorRT LLM inference runtime. This setup is now used to serve DeepSeek-V3.2 on Microsoft Foundry.

Key takeaway

For AI Engineers deploying large language models, especially MoE architectures, consider migrating to NVIDIA Blackwell platforms with NVFP4 quantization and TensorRT LLM. This combination can significantly reduce inference latency by up to 2.5x and increase user capacity by up to 16x per GPU compared to H200, directly impacting your operational costs and service scalability. Evaluate these technologies for your next-generation LLM deployments.

Key insights

Blackwell GPUs with NVFP4 quantization and TensorRT LLM dramatically boost MoE model inference performance.

Principles

Method

Achieved high-performance inference by co-optimizing hardware (NVIDIA GB200), model weights (NVFP4 quantization), and inference runtime (TensorRT LLM) for DeepSeek-V3.2.

In practice

Topics

Code references

Best for: Machine Learning Engineer, AI Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Microsoft Foundry Blog articles.