How Superhuman and Databricks built a 200K QPS inference platform together
Summary
Superhuman, a productivity platform serving over 40 million daily users, partnered with Databricks to modernize its AI model serving stack for real-time communication assistance. This custom large language model handles peak traffic exceeding 200,000 queries per second (QPS) with P99 latency under 1 second and 99.99% reliability. The migration from a DIY vLLM-based stack to Databricks model serving addressed compounding pain points like manual performance tuning for new model iterations and growing operational burden. The collaboration focused on optimizing load balancing with a power-of-two choices algorithm, accelerating container startup via lazy-loading image formats, and implementing runtime optimizations like FP8 quantization and multiprocessing to boost per-pod throughput by 60% on H100 GPUs, from 750 QPS to 1,200 QPS.
Key takeaway
For MLOps Engineers managing high-QPS, low-latency AI inference, consider a platform partnership that co-invests in engineering to meet strict SLAs. You should prioritize infrastructure optimizations like intelligent load balancing and image acceleration, alongside runtime improvements such as FP8 quantization and multiprocessing, to achieve significant throughput gains and maintain reliability for demanding real-time applications.
Key insights
A collaborative engineering approach can achieve high-scale, low-latency AI inference by optimizing infrastructure and runtime.
Principles
- Asymmetric autoscaling prevents latency spikes.
- Per-channel quantization improves FP8 quality.
- CPU bottlenecks can limit fast GPU models.
Method
Modernized LLM serving involved custom load balancing (power-of-two choices), image acceleration for faster container startup, and runtime optimizations like FP8 quantization and multiprocessing to eliminate CPU bottlenecks.
In practice
- Implement power-of-two load balancing for high QPS.
- Use lazy-loading container images to reduce cold starts.
- Explore FP8 quantization for throughput gains.
Topics
- Databricks Model Serving
- High-QPS Inference
- FP8 Quantization
- Image Acceleration
- Power-of-Two Choices Load Balancing
Best for: MLOps Engineer, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Databricks.