NVIDIA's AI Engineers: Agent Inference at Planetary Scale and "Speed of Light" — Nader Khalil (Brev), Kyle Kranen (Dynamo)
Summary
NVIDIA GTC will feature sessions with Kyle Kranen, an architect of NVIDIA Dynamo, and Nader Khalil, from NVIDIA Brev, discussing the company's rapid growth and strategic investments in AI. Dynamo is presented as a data center scale inference engine that optimizes serving by leveraging techniques like prefill/decode disaggregation, scheduling, and Kubernetes-based orchestration, addressing critical tradeoffs in cost, latency, and quality. Khalil highlights Brev's role in simplifying GPU access and NVIDIA's broader focus on developer experience, including the "SOL" (Speed of Light) urgency concept. The discussion also delves into agent security models, the "system as model" paradigm for complex AI workflows, and the importance of model/hardware co-design for managing long context lengths and future AI advancements. The team also explores the potential for "unhobblers"—scientific discoveries that drastically improve scaling—to overcome current limitations in context length.
Key takeaway
NVIDIA Dynamo optimizes datacenter-scale LLM inference by disaggregating prefill and decode phases, leveraging Kubernetes for specialized scaling to manage cost, quality, and latency tradeoffs. This enables efficient resource allocation for compute-bound prefill and memory-bound decode, exemplified by Kimi-2's 128K context fitting in just 8GB VRAM through hardware/model co-design. AI/ML professionals can utilize Dynamo for high-performance LLM and agent deployments, complemented by NVIDIA Brev's simplified GPU access and secure agent permission models.
Topics
- AI Inference Optimization
- NVIDIA Dynamo
- Agent-based AI Systems
- GPU Developer Tools
- Model-Hardware Co-design
Best for: CTO, VP of Engineering/Data, Director of AI/ML, Machine Learning Engineer, MLOps Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Latent Space: The AI Engineer Podcast.