NVIDIA DSX OS Delivers Open, Modular Software for Operating AI Factories at Scale

· Source: NVIDIA Technical Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure · Depth: Advanced, medium

Summary

NVIDIA has released DSX OS software, an open-source, modular platform designed to accelerate AI factory deployments and enhance operational efficiency for multi-tenant AI factories. As part of the broader NVIDIA DSX platform, DSX OS integrates across the five-layer AI stack (energy, chips, infrastructure, models, applications) to optimize token production relative to power consumption. Key components include DSX Exchange for standardized IT/OT communication, DSX MaxLPS and DSX Flex for power and efficiency optimization (enabling up to 40% more GPUs at peak efficiency), NVIDIA Infra Controller (NICo) and NVIDIA AI Cluster Runtime (AICR) for provisioning and lifecycle operations, NVIDIA NVSentinel and NVIDIA Fleet Intelligence for health monitoring and automated remediation, and KAI Scheduler, NVIDIA Run:ai, NVIDIA Dynamo, NVIDIA Grove, and NVIDIA Cloud Functions (NVCF) for intelligent workload scheduling and platform services. This comprehensive suite aims to deliver faster time to revenue, better efficiency, and higher reliability and resiliency for AI factory operators.

Key takeaway

For MLOps Engineers or AI Architects building and operating large-scale AI factories, NVIDIA DSX OS offers a critical open-source toolkit. You can significantly accelerate deployments and improve operational efficiency by adopting its modular components. Integrating DSX OS allows you to optimize power utilization, achieve up to 40% more GPU capacity within existing budgets, and automate fault remediation, directly lowering token costs and enhancing system reliability. Consider starting with components addressing your most immediate needs, like DSX Exchange for communication or NVIDIA Infra Controller for lifecycle management.

Key insights

NVIDIA DSX OS provides open, modular software to optimize AI factory operations, enhancing efficiency and reliability at scale.

Principles

Method

DSX OS components enable co-designed capabilities like standardized communication (MQTT), dynamic power policy enforcement, API-driven bare-metal management, and Kubernetes-native GPU fault detection.

In practice

Topics

Code references

Best for: CTO, VP of Engineering/Data, Director of AI/ML, MLOps Engineer, AI Architect, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.