High Performance Rate Limiting at Databricks

2025-12-15 · Source: ByteByteGo Newsletter · Field: Technology & Digital — Software Development & Engineering, Cloud Computing & IT Infrastructure, Artificial Intelligence & Machine Learning · Depth: Advanced, long

Summary

Databricks redesigned its rate limiting service to handle real-time AI workloads, moving from a Redis-backed architecture to a sharded in-memory system with optimistic batch-reporting. The original setup, using an Envoy ingress gateway and a single Redis instance, suffered from high tail latency (10-20ms P99), limited scalability, and a single point of failure under increased traffic from real-time model serving. The new architecture leverages Dicer, a routing layer, to enable horizontally scalable, fault-tolerant in-memory counters, eliminating network hops to Redis. Furthermore, Databricks implemented optimistic rate limiting, where clients make local decisions and asynchronously report counts every 100 milliseconds, significantly reducing client-side latency and stabilizing server load. This design accepts a controlled overshoot of approximately 5% of the rate limit policy, bounded by a rejection rate formula, client-side local limiting, and the adoption of a token bucket algorithm, which became feasible with in-memory storage.

Key takeaway

For AI Architects and MLOps Engineers designing high-throughput, low-latency services, consider adopting an optimistic, asynchronous rate limiting approach with in-memory sharded counters. This strategy, exemplified by Databricks, allows for significant reductions in tail latency and improved scalability by trading strict real-time accuracy for performance, provided your backend systems can tolerate a small percentage of overshoot.

Key insights

Optimistic rate limiting with in-memory sharded counters and asynchronous reporting significantly improves scalability and latency.

Principles

Strict accuracy is expensive at scale.
In-memory state enables advanced algorithms.
Distributed counting has three coupled aspects.

Method

Databricks implemented batch-reporting with optimistic rate limiting: clients make local decisions, asynchronously report counts, and receive rejection instructions, reducing synchronous calls and stabilizing server load.

In practice

Consider in-memory sharded storage for high-QPS counters.
Evaluate accuracy tradeoffs for performance gains.
Use token bucket for stricter rate limiting behavior.

Topics

Rate Limiting Architecture
Databricks Engineering
Real-time Model Serving
Distributed Counting
Optimistic Rate Limiting

Best for: AI Engineer, MLOps Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by ByteByteGo Newsletter.