Hosting Qwen on Blackwell - Perplexity

· Source: perplexity.ai via Google News · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure, Software Development & Engineering · Depth: Expert, long

Summary

Perplexity has successfully deployed and is serving post-trained Qwen3 235B models on NVIDIA GB200 NVL72 racks, leveraging Blackwell GPUs and 72-way NVLink interconnects to handle significant traffic at reduced cost. The deployment utilizes 18 nodes, each with 2 NVIDIA Grace CPUs and 4 Blackwell GPUs, totaling 72 GPUs interconnected by NVLink and 18 NVLink Switch ASICs, providing 1800 GB/s bandwidth. Perplexity adapted its in-house inference engine, TransferEngine, to disaggregate prefill and decode operations, using InfiniBand for prefiller-to-decoder communication and NVLink for intra-node communication. This setup allows for distinct parallelism strategies: tensor parallelism (TP=4, EP=4) for compute-intensive prefill and data/expert parallelism (EP=16) for memory-bound decode, with Blackwell showing kernel-by-kernel improvements over Hopper GPUs. The deployment uses MXFP8 quantization for Qwen3 235B, achieving improved performance with similar accuracy to block scaling.

Key takeaway

For AI Architects and ML Engineers evaluating next-generation inference infrastructure, the successful deployment of Qwen3 235B on NVIDIA Blackwell GB200 NVL72 racks demonstrates significant throughput and cost efficiency gains. You should consider Blackwell's large NVLink domains and hardware-accelerated features like SHARP and MXFP8 quantization for scaling large Mixture-of-Experts (MoE) models, particularly when disaggregating prefill and decode stages to optimize resource utilization and reduce latency.

Key insights

Blackwell GPUs with NVLink enable efficient, cost-effective deployment of large MoE models like Qwen3 235B.

Principles

Method

Deploy MoE models on GB200 NVL72 by disaggregating prefill (tensor parallel, TP=4, EP=4) and decode (data/expert parallel, EP=16), using InfiniBand for inter-stage and NVLink/SHARP for intra-stage communication, and MXFP8 quantization.

In practice

Topics

Best for: AI Engineer, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by perplexity.ai via Google News.