Develop Physical AI Reasoning, World, and Action Models with NVIDIA Cosmos 3

· Source: NVIDIA Technical Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Cloud Computing & IT Infrastructure · Depth: Expert, long

Summary

NVIDIA has released Cosmos 3, a frontier foundation model for physical AI that unifies physical reasoning, world generation, and action generation within a single open model. This release includes open-source Cosmos 3 models, training scripts, deployment tools, and six synthetic data generation datasets for applications like robotics and autonomous driving. The architecture features a Mixture-of-Transformers with a 16B-parameter Cosmos 3 Nano for workstation-grade compute (NVIDIA RTX PRO 6000 GPU) and a 64B-parameter Cosmos 3 Super for datacenter deployment (NVIDIA Hopper/Blackwell GPUs). Cosmos 3 supports various input/output modalities and leads on benchmarks such as VANTAGE-Bench and PAI-Bench. NVIDIA also introduced the Cosmos Human Evaluation (HUE) framework for objective video generation quality assessment. Training recipes for Supervised Fine-Tuning and action post-training are provided, alongside deployment options via NVIDIA NIM microservices, which offer optimizations like NVFP4 quantization and vLLM for efficient inference.

Key takeaway

For Machine Learning Engineers developing physical AI systems, NVIDIA Cosmos 3 offers a unified, open-source foundation model to streamline development. You can utilize its 16B-parameter Nano model for efficient edge inference or the 64B-parameter Super model for high-quality datacenter workloads. Leverage the provided training recipes for Supervised Fine-Tuning with your custom data and deploy optimized models using NVIDIA NIM microservices to accelerate your projects.

Key insights

Cosmos 3 unifies physical AI reasoning, world generation, and action generation into a single, open foundation model.

Principles

Method

Cosmos 3 employs a Mixture-of-Transformers with a VLM-based Reasoner tower for interpretation and a diffusion-based Generator tower for physics-aware video and action sequence output.

In practice

Topics

Code references

Best for: AI Scientist, Machine Learning Engineer, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.