Building a fault-tolerant metrics storage system at Airbnb

2026-04-21 · Source: The Airbnb Tech Blog - Medium · Field: Technology & Digital — Software Development & Engineering, Cloud Computing & IT Infrastructure · Depth: Advanced, long

Summary

Airbnb developed a fault-tolerant metrics storage system to handle 1.3 billion active time series and ingest 50 million samples per second, storing 2.5 petabytes of logical data. Moving from a hosted provider, they faced challenges in persisting and serving this data for approximately 1,000 services. Their solution involved a multi-tenant architecture with shuffle sharding to isolate workloads and a consolidated control plane for automated tenant onboarding and configuration. The system was designed to support 10,000 dashboards and 500,000 alerts with p99 query execution under 30 seconds. Initial efforts focused on stabilizing single clusters by implementing write and read guardrails, including limits on fetched series/chunks per query and sharding compaction workloads. Subsequently, a multi-cluster environment was adopted for blast radius reduction and flexibility, leveraging Promxy for cross-cluster querying, despite federated queries being 5-10x costlier.

Key takeaway

For DevOps Engineers designing large-scale observability platforms, recognize that multi-tenant metrics systems demand robust isolation and automated management. You should implement shuffle sharding and tenant-level guardrails to prevent single-tenant issues from impacting the entire system. Plan for multi-cluster architectures early to reduce blast radius, but be aware that federated queries can be 5-10x more resource-intensive, requiring careful optimization. Automate stateful app deployments to maintain consistency across clusters.

Key insights

Scaling observability requires robust multi-tenant isolation, automated management, and strategic multi-cluster deployment to ensure reliability.

Principles

Isolate tenant workloads with shuffle sharding.
Automate tenant onboarding and configuration.
Implement tenant-level write and read guardrails.

Method

Airbnb built a metrics storage system by first stabilizing single clusters with guardrails and autoscaling, then adopting a multi-cluster architecture with Promxy for federated querying and automated deployments.

In practice

Use shuffle sharding for tenant isolation.
Deploy stateful apps with Kubernetes rollout operators.
Use Promxy for cross-cluster querying.

Topics

Metrics Storage
Observability Platforms
Multi-tenant Architecture
Shuffle Sharding
Fault Tolerance
Promxy
Kubernetes Operators

Best for: MLOps Engineer, DevOps Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The Airbnb Tech Blog - Medium.