The Hidden Challenges of Running AI at Scale in Production

· Source: The Data Exchange · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure, Software Development & Engineering · Depth: Intermediate, extended

Summary

Chen Goldberg, EVP of Engineering at CoreWeave, discusses the infrastructure challenges and shifts required for moving AI from pilot to production. She highlights the necessity of specialized AI clouds, like CoreWeave, over general-purpose hyperscalers for optimizing "Goodput"—the actual time GPUs spend on useful work. The conversation covers the complexities of the current GPU and memory supply chain, the emergence of reinforcement learning in enterprise settings beyond Silicon Valley, and the evolving nature of engineering roles in the AI era. CoreWeave's "Arena" platform is introduced as a real-infrastructure benchmarking tool, allowing customers to stress-test workloads and optimize configurations across multiple dimensions, including storage, networking, and data flow, without relying on simulations or guesswork. The discussion also touches on the increasing adoption of AI in diverse sectors like finance, retail, and healthcare for tasks such as risk assessment and unique customer experiences.

Key takeaway

For CTOs and VPs of Engineering scaling AI initiatives, relying solely on general-purpose hyperscalers for production AI workloads risks significant cost inefficiencies and performance bottlenecks. You should evaluate specialized AI cloud providers and adopt a "best-of-breed" approach to infrastructure, leveraging tools that offer real-world benchmarking and transparency into system performance. Prioritize solutions that optimize for "Goodput" and provide comprehensive telemetry to accelerate troubleshooting and ensure reliability, especially as agentic workflows and complex model training become more prevalent.

Key insights

Specialized AI clouds and infrastructure optimization are crucial for scaling AI from pilots to production workloads.

Principles

Method

CoreWeave's "Arena" provides real-infrastructure benchmarking to stress-test AI workloads, allowing customers to optimize configurations across compute, storage, and networking for production-grade performance and reliability.

In practice

Topics

Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Engineer, MLOps Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The Data Exchange.