High Performance Rate Limiting at Databricks

· Source: ByteByteGo Newsletter · Field: Technology & Digital — Software Development & Engineering, Cloud Computing & IT Infrastructure, Artificial Intelligence & Machine Learning · Depth: Advanced, long

Summary

Databricks redesigned its rate limiting service to handle real-time AI workloads, moving from a Redis-backed architecture to a sharded in-memory system with optimistic batch-reporting. The original setup, using an Envoy ingress gateway and a single Redis instance, suffered from high tail latency (10-20ms P99), limited scalability, and a single point of failure under increased traffic from real-time model serving. The new architecture leverages Dicer, a routing layer, to enable horizontally scalable, fault-tolerant in-memory counters, eliminating network hops to Redis. Furthermore, Databricks implemented optimistic rate limiting, where clients make local decisions and asynchronously report counts every 100 milliseconds, significantly reducing client-side latency and stabilizing server load. This design accepts a controlled overshoot of approximately 5% of the rate limit policy, bounded by a rejection rate formula, client-side local limiting, and the adoption of a token bucket algorithm, which became feasible with in-memory storage.

Key takeaway

For AI Architects and MLOps Engineers designing high-throughput, low-latency services, consider adopting an optimistic, asynchronous rate limiting approach with in-memory sharded counters. This strategy, exemplified by Databricks, allows for significant reductions in tail latency and improved scalability by trading strict real-time accuracy for performance, provided your backend systems can tolerate a small percentage of overshoot.

Key insights

Optimistic rate limiting with in-memory sharded counters and asynchronous reporting significantly improves scalability and latency.

Principles

Method

Databricks implemented batch-reporting with optimistic rate limiting: clients make local decisions, asynchronously report counts, and receive rejection instructions, reducing synchronous calls and stabilizing server load.

In practice

Topics

Best for: AI Engineer, MLOps Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by ByteByteGo Newsletter.