How the lakebase architecture stays resilient to cloud failures
Summary
Databricks Lakebase architecture enhances resilience to cloud failures, addressing agentic workloads' high control-plane throughput and on-demand capacity needs. It features stateless Postgres compute, with durable data in remote, zone-redundant storage, allowing instant replacement without traditional replication or crash recovery. All databases use distributed, zone-redundant object storage with NVMe SSD caches. The control plane, now critical for agentic operations, is being refactored into a dedicated data plane controller for hot-path operations, minimizing external dependencies. Lakebase reduces reliance on cloud provider control planes by managing its own instance pools and virtualization layer. The system employs "cells" for regional scaling and blast radius containment, demonstrated during a May 8, 2026 AWS incident where impact was limited to ~13% of databases. Rigorous failure simulation, including whole-AZ down tests, aims for less than 30 seconds of workload downtime. Availability is measured via SLIs/SLOs, targeting 99.99% for every database; attainment for 99.99% was 99.85% (Jan 2026) to 99.75% (Apr 2026).
Key takeaway
For MLOps Engineers or AI Architects building agentic database workloads, Lakebase's architectural patterns provide a blueprint for robust, scalable infrastructure. You should consider adopting similar strategies, like stateless compute and refactoring critical control plane operations, to mitigate cloud failure risks. This ensures high availability for your critical agentic applications, even during cloud provider incidents.
Key insights
Lakebase achieves high availability for agentic workloads through stateless compute, zone-redundant storage, and a refactored control plane.
Principles
- High Availability must be a core design tenet.
- Control plane operations are critical for agentic workloads.
- Compartmentalize faults to limit blast radius.
Method
Implement stateless Postgres compute with remote, zone-redundant storage. Separate critical control plane functions into a dedicated data plane controller. Employ cell-based regional deployments for fault isolation. Conduct rigorous failure injection and whole-AZ down simulations.
In practice
- Configure dedicated computes across multiple AZs for HA.
- Implement failpoints and chaos testing in releases.
- Measure individual database availability via SLIs/SLOs.
Topics
- Lakebase Architecture
- Agentic Workloads
- Cloud Resilience
- High Availability
- Chaos Engineering
- Service Level Objectives
Code references
Best for: AI Architect, MLOps Engineer, Data Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Databricks.