Leading Inference Providers Cut AI Costs by up to 10x With Open Source Models on NVIDIA Blackwell

2026-02-12 · Source: NVIDIA Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure, Emerging Technologies & Innovation · Depth: Advanced, short

Summary

NVIDIA's latest AI infrastructure, including the Blackwell and upcoming Rubin platforms, addresses the extreme co-design requirements of next-generation AI, particularly for real-time reasoning and Mixture-of-Expert (MoE) models like DeepSeek-R1 and Kimi K2 Thinking. These models stress compute, memory, networking, and storage, making cost per token a critical performance metric. NVIDIA's GB200 NVL72 system, for instance, offers up to 20x the performance of the H200, resulting in a 1/12th cost per token despite a 67% higher unit cost. The Blackwell platform extends co-design to full rack scale, unifying 72 GPUs with NVLink Switch chips and leveraging NVFP4, Dynamo, and TensorRT LLM. The Rubin platform further advances this with six new chips, including the Vera CPU and Rubin GPU, optimized for AI reasoning at scale, emphasizing end-to-end system design for efficiency.

Key takeaway

For CTOs and VPs of Engineering evaluating next-generation AI infrastructure, prioritize end-to-end system co-design over isolated component performance. Your focus should be on solutions like NVIDIA's Blackwell and Rubin platforms that optimize the entire stack, from silicon to networking, to achieve the lowest cost per token for reasoning AI. This approach will significantly improve ROI and enable larger model deployment with high responsiveness.

Key insights

Extreme co-design across the entire data center stack is crucial for cost-effective, high-performance AI reasoning.

Principles

Cost per token is a key metric for AI inference ROI.
System design outweighs raw FLOPS per dollar.
End-to-end optimization reduces AI operational costs.

Method

Co-design the entire AI stack, from silicon to rack to network, including GPUs, CPUs, interconnects, and cooling, to optimize for cost per token.

In practice

Evaluate systems in real-world environments.
Prioritize integrated rack-scale solutions.
Consider NVLink, Spectrum-6, and InfiniBand for interconnects.

Topics

NVIDIA Blackwell
Rubin AI Platform
AI Data Center Co-design
Mixture-of-Expert Models
Cost Per Token Optimization

Best for: CTO, VP of Engineering/Data, AI Engineer, MLOps Engineer, AI Architect, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Blog.