NVIDIA's AI Engineers: Agent Inference at Planetary Scale and "Speed of Light" — Nader Khalil (Brev), Kyle Kranen (Dynamo)

· Source: Latent Space: The AI Engineer Podcast · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Cloud Computing & IT Infrastructure · Depth: Advanced, extended

Summary

NVIDIA GTC will feature sessions with Kyle Kranen, an architect of NVIDIA Dynamo, and Nader Khalil, from NVIDIA Brev, discussing the company's rapid growth and strategic investments in AI. Dynamo is presented as a data center scale inference engine that optimizes serving by leveraging techniques like prefill/decode disaggregation, scheduling, and Kubernetes-based orchestration, addressing critical tradeoffs in cost, latency, and quality. Khalil highlights Brev's role in simplifying GPU access and NVIDIA's broader focus on developer experience, including the "SOL" (Speed of Light) urgency concept. The discussion also delves into agent security models, the "system as model" paradigm for complex AI workflows, and the importance of model/hardware co-design for managing long context lengths and future AI advancements. The team also explores the potential for "unhobblers"—scientific discoveries that drastically improve scaling—to overcome current limitations in context length.

Key takeaway

NVIDIA Dynamo optimizes datacenter-scale LLM inference by disaggregating prefill and decode phases, leveraging Kubernetes for specialized scaling to manage cost, quality, and latency tradeoffs. This enables efficient resource allocation for compute-bound prefill and memory-bound decode, exemplified by Kimi-2's 128K context fitting in just 8GB VRAM through hardware/model co-design. AI/ML professionals can utilize Dynamo for high-performance LLM and agent deployments, complemented by NVIDIA Brev's simplified GPU access and secure agent permission models.

Topics

Best for: CTO, VP of Engineering/Data, Director of AI/ML, Machine Learning Engineer, MLOps Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Latent Space: The AI Engineer Podcast.