Extreme Co-Design for Efficient Tokenomics and AI at Scale
Summary
The evolution of AI towards real-time reasoning, particularly with mixture-of-expert (MoE) models like DeepSeek-R1 and Kimi K2 Thinking, necessitates "extreme co-design" across the entire data center stack. This approach integrates compute, memory, networking, storage, and software to optimize performance and economics, shifting focus from raw FLOPS to cost per token. For instance, the NVIDIA GB200 NVL72 offers up to 20x the performance of the H200, resulting in a 1/12th cost per token despite a 67% higher price. NVIDIA's Blackwell architecture extends co-design to rack scale with 72 GPUs, NVLink Switch chips, NVFP4, Dynamo, and TensorRT LLM, while the upcoming Rubin architecture further integrates six new chips including Vera CPU and Rubin GPU. Azure and CoreWeave exemplify this co-design, optimizing infrastructure from silicon to data center orchestration to achieve efficiency and responsiveness for large AI models.
Key takeaway
For AI/ML Directors evaluating infrastructure for large-scale reasoning AI, prioritize end-to-end system co-design over isolated component performance. Your decision should focus on platforms like NVIDIA's Blackwell or Rubin, which integrate hardware and software from silicon to rack, to achieve the lowest cost per token and ensure responsiveness for MoE models. Consider partners like Azure or CoreWeave that demonstrate this integrated approach.
Key insights
Extreme co-design across the entire AI stack is crucial for efficient, scalable real-time reasoning and MoE models.
Principles
- Cost per token is a key metric for AI inference.
- System design outweighs FLOPS per dollar for ROI.
- Co-design must span silicon to software.
Method
Extreme co-design involves engineering the entire AI stack—compute, memory, networking, storage, and software—as a single system across the data center to optimize cost per token and performance.
In practice
- Evaluate systems in real-world environments.
- Consider GB200 for 1/12th cost per token.
- Utilize NVLink, NVFP4, TensorRT LLM.
Topics
- Extreme Co-Design
- Mixture-of-Experts
- Cost Per Token
- NVIDIA Blackwell
- NVIDIA Rubin
Best for: VP of Engineering/Data, Director of AI/ML, AI Product Manager, AI Architect, MLOps Engineer, CTO
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA.