High Performance Rate Limiting at Databricks
Summary
Databricks redesigned its rate limiting service to handle real-time AI workloads, moving from a Redis-backed architecture to a sharded in-memory system with optimistic batch-reporting. The original setup, using an Envoy ingress gateway and a single Redis instance, suffered from high tail latency (10-20ms P99), limited scalability, and a single point of failure under increased traffic from real-time model serving. The new architecture leverages Dicer, a routing layer, to enable horizontally scalable, fault-tolerant in-memory counters, eliminating network hops to Redis. Furthermore, Databricks implemented optimistic rate limiting, where clients make local decisions and asynchronously report counts every 100 milliseconds, significantly reducing client-side latency and stabilizing server load. This design accepts a controlled overshoot of approximately 5% of the rate limit policy, bounded by a rejection rate formula, client-side local limiting, and the adoption of a token bucket algorithm, which became feasible with in-memory storage.
Key takeaway
For AI Architects and MLOps Engineers designing high-throughput, low-latency services, consider adopting an optimistic, asynchronous rate limiting approach with in-memory sharded counters. This strategy, exemplified by Databricks, allows for significant reductions in tail latency and improved scalability by trading strict real-time accuracy for performance, provided your backend systems can tolerate a small percentage of overshoot.
Key insights
Optimistic rate limiting with in-memory sharded counters and asynchronous reporting significantly improves scalability and latency.
Principles
- Strict accuracy is expensive at scale.
- In-memory state enables advanced algorithms.
- Distributed counting has three coupled aspects.
Method
Databricks implemented batch-reporting with optimistic rate limiting: clients make local decisions, asynchronously report counts, and receive rejection instructions, reducing synchronous calls and stabilizing server load.
In practice
- Consider in-memory sharded storage for high-QPS counters.
- Evaluate accuracy tradeoffs for performance gains.
- Use token bucket for stricter rate limiting behavior.
Topics
- Rate Limiting Architecture
- Databricks Engineering
- Real-time Model Serving
- Distributed Counting
- Optimistic Rate Limiting
Best for: AI Engineer, MLOps Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by ByteByteGo Newsletter.