Unlocking High-Performance Inference for DeepSeek with NVFP4 on NVIDIA Blackwell
Summary
Microsoft and NVIDIA partnered to optimize single-node inference for the 690-billion-parameter DeepSeek-V3.2 Mixture-of-Experts (MoE) model on NVIDIA Blackwell architecture. Experiments on a single NVIDIA GB200 node (2 Grace Blackwell superchips, 4 Blackwell GPUs) using NVIDIA's NVFP4 checkpoint for DeepSeek-V3.2 and NVIDIA TensorRT LLM demonstrated significant performance gains. This configuration achieved up to 2.5x lower per-user latency compared to NVIDIA H200 GPUs and could serve up to 16 times more users per GPU while maintaining a consistent latency target. The optimization involved hardware (GB200 NVL72), NVFP4-quantized model weights (reducing memory footprint by 1.7x from 690 GB to 415 GB), and the TensorRT LLM inference runtime. This setup is now used to serve DeepSeek-V3.2 on Microsoft Foundry.
Key takeaway
For AI Engineers deploying large language models, especially MoE architectures, consider migrating to NVIDIA Blackwell platforms with NVFP4 quantization and TensorRT LLM. This combination can significantly reduce inference latency by up to 2.5x and increase user capacity by up to 16x per GPU compared to H200, directly impacting your operational costs and service scalability. Evaluate these technologies for your next-generation LLM deployments.
Key insights
Blackwell GPUs with NVFP4 quantization and TensorRT LLM dramatically boost MoE model inference performance.
Principles
- End-to-end optimization is crucial for large MoE models.
- Lower precision formats can preserve accuracy while improving efficiency.
Method
Achieved high-performance inference by co-optimizing hardware (NVIDIA GB200), model weights (NVFP4 quantization), and inference runtime (TensorRT LLM) for DeepSeek-V3.2.
In practice
- Use NVFP4 for Blackwell-native 4-bit floating point inference.
- Deploy TensorRT LLM for optimized LLM serving on NVIDIA GPUs.
Topics
- NVIDIA Blackwell
- DeepSeek-V3.2
- LLM Inference Optimization
- NVFP4 Quantization
- TensorRT LLM
Code references
Best for: Machine Learning Engineer, AI Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Microsoft Foundry Blog articles.