Hosting Qwen on Blackwell - Perplexity
Summary
Perplexity has successfully deployed and is serving post-trained Qwen3 235B models on NVIDIA GB200 NVL72 racks, leveraging Blackwell GPUs and 72-way NVLink interconnects to handle significant traffic at reduced cost. The deployment utilizes 18 nodes, each with 2 NVIDIA Grace CPUs and 4 Blackwell GPUs, totaling 72 GPUs interconnected by NVLink and 18 NVLink Switch ASICs, providing 1800 GB/s bandwidth. Perplexity adapted its in-house inference engine, TransferEngine, to disaggregate prefill and decode operations, using InfiniBand for prefiller-to-decoder communication and NVLink for intra-node communication. This setup allows for distinct parallelism strategies: tensor parallelism (TP=4, EP=4) for compute-intensive prefill and data/expert parallelism (EP=16) for memory-bound decode, with Blackwell showing kernel-by-kernel improvements over Hopper GPUs. The deployment uses MXFP8 quantization for Qwen3 235B, achieving improved performance with similar accuracy to block scaling.
Key takeaway
For AI Architects and ML Engineers evaluating next-generation inference infrastructure, the successful deployment of Qwen3 235B on NVIDIA Blackwell GB200 NVL72 racks demonstrates significant throughput and cost efficiency gains. You should consider Blackwell's large NVLink domains and hardware-accelerated features like SHARP and MXFP8 quantization for scaling large Mixture-of-Experts (MoE) models, particularly when disaggregating prefill and decode stages to optimize resource utilization and reduce latency.
Key insights
Blackwell GPUs with NVLink enable efficient, cost-effective deployment of large MoE models like Qwen3 235B.
Principles
- Disaggregate prefill and decode for optimal parallelism.
- Utilize NVLink Switches for efficient all-reduce operations.
- Microscaling quantization improves Blackwell throughput.
Method
Deploy MoE models on GB200 NVL72 by disaggregating prefill (tensor parallel, TP=4, EP=4) and decode (data/expert parallel, EP=16), using InfiniBand for inter-stage and NVLink/SHARP for intra-stage communication, and MXFP8 quantization.
In practice
- Consider GB200 NVL72 for large MoE model inference.
- Implement disaggregated prefill/decode for LLM serving.
- Explore MXFP8 quantization for Blackwell deployments.
Topics
- NVIDIA Blackwell GPUs
- Qwen Models
- Mixture-of-Experts
- LLM Inference Optimization
- NVLink Interconnect
Best for: AI Engineer, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by perplexity.ai via Google News.