Introducing AI Runtime: Scalable, Serverless NVIDIA GPUs on Databricks for Training and Finetuning
Summary
Databricks has announced the Public Preview of AI Runtime (AIR), a new training stack designed to simplify on-demand distributed GPU training for advanced AI workloads. AIR provides serverless access to NVIDIA A10 and H100 GPUs directly within Databricks Notebooks, eliminating the need for cluster management and charging only for active GPU usage. It integrates with Databricks' orchestration suite, including Lakeflow Jobs and Declarative Automation Bundles (DABs), for production-ready GPU workloads. The runtime is optimized for distributed deep learning, bundling performance enhancements like RDMA and high-performance data loading, and comes with pre-installed dependencies and support for frameworks such as PyTorch, Ray, and Hugging Face Transformers. AIR also offers centralized governance and observability through MLflow and Unity Catalog, with current support for distributed training across 8x H100s in a single node.
Key takeaway
For AI Scientists and Research Scientists struggling with GPU infrastructure and distributed training complexities, Databricks' AI Runtime offers a streamlined solution. You can now focus on model development by leveraging on-demand A10 and H100 GPUs, pre-optimized environments, and integrated orchestration tools, significantly reducing setup and debugging time from days to hours.
Key insights
AI Runtime simplifies distributed GPU training by offering serverless, on-demand access and optimized tools within Databricks.
Principles
- Focus on modeling, not infrastructure
- Pay-as-you-go GPU compute
- Integrate training with data pipelines
Method
Configure notebooks for A10/H100 GPUs, use Lakeflow for job orchestration, and leverage pre-optimized distributed training frameworks with MLflow for observability.
In practice
- Train LLMs like MPT and DBRX
- Develop computer vision models
- Fine-tune LLMs for agentic tasks
Topics
- AI Runtime
- Distributed GPU Training
- LLM Fine-tuning
- MLOps
- NVIDIA GPUs
Best for: AI Scientist, Research Scientist, CTO, AI Researcher, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Databricks.