Extreme Co-Design for Efficient Tokenomics and AI at Scale

2026-02-12 · Source: NVIDIA · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure, Emerging Technologies & Innovation · Depth: Advanced, short

Summary

The evolution of AI towards real-time reasoning, particularly with mixture-of-expert (MoE) models like DeepSeek-R1 and Kimi K2 Thinking, necessitates "extreme co-design" across the entire data center stack. This approach integrates compute, memory, networking, storage, and software to optimize performance and economics, shifting focus from raw FLOPS to cost per token. For instance, the NVIDIA GB200 NVL72 offers up to 20x the performance of the H200, resulting in a 1/12th cost per token despite a 67% higher price. NVIDIA's Blackwell architecture extends co-design to rack scale with 72 GPUs, NVLink Switch chips, NVFP4, Dynamo, and TensorRT LLM, while the upcoming Rubin architecture further integrates six new chips including Vera CPU and Rubin GPU. Azure and CoreWeave exemplify this co-design, optimizing infrastructure from silicon to data center orchestration to achieve efficiency and responsiveness for large AI models.

Key takeaway

For AI/ML Directors evaluating infrastructure for large-scale reasoning AI, prioritize end-to-end system co-design over isolated component performance. Your decision should focus on platforms like NVIDIA's Blackwell or Rubin, which integrate hardware and software from silicon to rack, to achieve the lowest cost per token and ensure responsiveness for MoE models. Consider partners like Azure or CoreWeave that demonstrate this integrated approach.

Key insights

Extreme co-design across the entire AI stack is crucial for efficient, scalable real-time reasoning and MoE models.

Principles

Cost per token is a key metric for AI inference.
System design outweighs FLOPS per dollar for ROI.
Co-design must span silicon to software.

Method

Extreme co-design involves engineering the entire AI stack—compute, memory, networking, storage, and software—as a single system across the data center to optimize cost per token and performance.

In practice

Evaluate systems in real-world environments.
Consider GB200 for 1/12th cost per token.
Utilize NVLink, NVFP4, TensorRT LLM.

Topics

Extreme Co-Design
Mixture-of-Experts
Cost Per Token
NVIDIA Blackwell
NVIDIA Rubin

Best for: VP of Engineering/Data, Director of AI/ML, AI Product Manager, AI Architect, MLOps Engineer, CTO

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA.