Reliable LLM Inference at Scale
Summary
Databricks has developed a robust inference platform for serving frontier LLMs, including open-source models like Kimi and Qwen, and proprietary ones like OpenAI, Gemini, and Claude, processing over 125T tokens per month. The platform addresses significant challenges in achieving reliability and low latency at scale, particularly with spiky demand and the inherent unreliability of high-bandwidth GPU systems. Key architectural components include an inference runtime, an Axon router, an autoscaler, rate limiting, and a capacity management algorithm. Databricks introduced "model units" to quantify request cost, enabling cost-based load balancing with Dicer and autoscaling that saved over 80% GPU costs compared to static provisioning. Runtime reliability is enhanced by prioritized black-box health checks, reducing false liveness probe failures to zero, and optimizing multimodal request handling by switching to Torchvision-based image processors and configuring OMP_NUM_THREADS, which boosted RPS per server by over 3x.
Key takeaway
For MLOps Engineers managing large-scale LLM deployments, understanding Databricks' strategies for reliability and cost efficiency is crucial. You should consider implementing a "model unit" abstraction to quantify request costs, enabling more precise load balancing and autoscaling. Prioritize health checks in your inference stack to prevent silent hangs and cascading failures. Additionally, optimize multimodal request handling by leveraging efficient image processors like Torchvision and correctly configuring OMP_NUM_THREADS to avoid CPU throttling and achieve significant performance gains.
Key insights
Reliable LLM inference at scale requires sophisticated capacity management, cost-based routing, and robust runtime failure detection.
Principles
- LLM request cost is highly variable and hard to estimate a priori.
- High-bandwidth GPU systems are less reliable than classical CPU systems.
- Overprovisioning for LLM inference is cost-prohibitive due to compute constraints.
Method
Model request cost with "model units" for capacity management. Implement cost-based load balancing and autoscaling. Ensure runtime reliability via prioritized health checks and optimized multimodal processing.
In practice
- Implement "model units" for granular capacity estimation.
- Use cost-based load balancing for LLM workloads.
- Prioritize health checks to prevent cascading failures.
Topics
- LLM Inference
- Scalability
- Databricks Platform
- Capacity Management
- Load Balancing
- Autoscaling
- Multimodal AI
Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Engineer, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Databricks.