Reliable LLM Inference at Scale

· Source: Databricks · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure, Software Development & Engineering · Depth: Advanced, medium

Summary

Databricks has developed a robust inference platform for serving frontier LLMs, including open-source models like Kimi and Qwen, and proprietary ones like OpenAI, Gemini, and Claude, processing over 125T tokens per month. The platform addresses significant challenges in achieving reliability and low latency at scale, particularly with spiky demand and the inherent unreliability of high-bandwidth GPU systems. Key architectural components include an inference runtime, an Axon router, an autoscaler, rate limiting, and a capacity management algorithm. Databricks introduced "model units" to quantify request cost, enabling cost-based load balancing with Dicer and autoscaling that saved over 80% GPU costs compared to static provisioning. Runtime reliability is enhanced by prioritized black-box health checks, reducing false liveness probe failures to zero, and optimizing multimodal request handling by switching to Torchvision-based image processors and configuring OMP_NUM_THREADS, which boosted RPS per server by over 3x.

Key takeaway

For MLOps Engineers managing large-scale LLM deployments, understanding Databricks' strategies for reliability and cost efficiency is crucial. You should consider implementing a "model unit" abstraction to quantify request costs, enabling more precise load balancing and autoscaling. Prioritize health checks in your inference stack to prevent silent hangs and cascading failures. Additionally, optimize multimodal request handling by leveraging efficient image processors like Torchvision and correctly configuring OMP_NUM_THREADS to avoid CPU throttling and achieve significant performance gains.

Key insights

Reliable LLM inference at scale requires sophisticated capacity management, cost-based routing, and robust runtime failure detection.

Principles

Method

Model request cost with "model units" for capacity management. Implement cost-based load balancing and autoscaling. Ensure runtime reliability via prioritized health checks and optimized multimodal processing.

In practice

Topics

Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Engineer, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Databricks.