Leading Inference Providers Cut AI Costs by up to 10x With Open Source Models on NVIDIA Blackwell
Summary
NVIDIA's latest AI infrastructure, including the Blackwell and upcoming Rubin platforms, addresses the extreme co-design requirements of next-generation AI, particularly for real-time reasoning and Mixture-of-Expert (MoE) models like DeepSeek-R1 and Kimi K2 Thinking. These models stress compute, memory, networking, and storage, making cost per token a critical performance metric. NVIDIA's GB200 NVL72 system, for instance, offers up to 20x the performance of the H200, resulting in a 1/12th cost per token despite a 67% higher unit cost. The Blackwell platform extends co-design to full rack scale, unifying 72 GPUs with NVLink Switch chips and leveraging NVFP4, Dynamo, and TensorRT LLM. The Rubin platform further advances this with six new chips, including the Vera CPU and Rubin GPU, optimized for AI reasoning at scale, emphasizing end-to-end system design for efficiency.
Key takeaway
For CTOs and VPs of Engineering evaluating next-generation AI infrastructure, prioritize end-to-end system co-design over isolated component performance. Your focus should be on solutions like NVIDIA's Blackwell and Rubin platforms that optimize the entire stack, from silicon to networking, to achieve the lowest cost per token for reasoning AI. This approach will significantly improve ROI and enable larger model deployment with high responsiveness.
Key insights
Extreme co-design across the entire data center stack is crucial for cost-effective, high-performance AI reasoning.
Principles
- Cost per token is a key metric for AI inference ROI.
- System design outweighs raw FLOPS per dollar.
- End-to-end optimization reduces AI operational costs.
Method
Co-design the entire AI stack, from silicon to rack to network, including GPUs, CPUs, interconnects, and cooling, to optimize for cost per token.
In practice
- Evaluate systems in real-world environments.
- Prioritize integrated rack-scale solutions.
- Consider NVLink, Spectrum-6, and InfiniBand for interconnects.
Topics
- NVIDIA Blackwell
- Rubin AI Platform
- AI Data Center Co-design
- Mixture-of-Expert Models
- Cost Per Token Optimization
Best for: CTO, VP of Engineering/Data, AI Engineer, MLOps Engineer, AI Architect, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Blog.