How Superhuman and Databricks built a 200K QPS inference platform together

2026-05-08 · Source: Databricks · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure · Depth: Advanced, medium

Summary

Superhuman, a productivity platform serving over 40 million daily users, partnered with Databricks to modernize its AI model serving stack for real-time communication assistance. This custom large language model handles peak traffic exceeding 200,000 queries per second (QPS) with P99 latency under 1 second and 99.99% reliability. The migration from a DIY vLLM-based stack to Databricks model serving addressed compounding pain points like manual performance tuning for new model iterations and growing operational burden. The collaboration focused on optimizing load balancing with a power-of-two choices algorithm, accelerating container startup via lazy-loading image formats, and implementing runtime optimizations like FP8 quantization and multiprocessing to boost per-pod throughput by 60% on H100 GPUs, from 750 QPS to 1,200 QPS.

Key takeaway

For MLOps Engineers managing high-QPS, low-latency AI inference, consider a platform partnership that co-invests in engineering to meet strict SLAs. You should prioritize infrastructure optimizations like intelligent load balancing and image acceleration, alongside runtime improvements such as FP8 quantization and multiprocessing, to achieve significant throughput gains and maintain reliability for demanding real-time applications.

Key insights

A collaborative engineering approach can achieve high-scale, low-latency AI inference by optimizing infrastructure and runtime.

Principles

Asymmetric autoscaling prevents latency spikes.
Per-channel quantization improves FP8 quality.
CPU bottlenecks can limit fast GPU models.

Method

Modernized LLM serving involved custom load balancing (power-of-two choices), image acceleration for faster container startup, and runtime optimizations like FP8 quantization and multiprocessing to eliminate CPU bottlenecks.

In practice

Implement power-of-two load balancing for high QPS.
Use lazy-loading container images to reduce cold starts.
Explore FP8 quantization for throughput gains.

Topics

Databricks Model Serving
High-QPS Inference
FP8 Quantization
Image Acceleration
Power-of-Two Choices Load Balancing

Best for: MLOps Engineer, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Databricks.