Reliability fail: No automated zone failover for Coinbase’s global trading service

2026-06-23 · Source: The Pragmatic Engineer · Field: Technology & Digital — Cloud Computing & IT Infrastructure, Software Development & Engineering · Depth: Intermediate, short

Summary

On May 7, 2026, Coinbase experienced a nearly 10-hour global trading outage, coinciding with a regional AWS disruption. The \$40B company, which processes \$5.2 trillion annually, confirmed its core matching engine was intentionally confined to a single AWS Availability Zone (AZ) to meet low-latency demands for its Raft-based replicated cluster. However, Coinbase critically lacked automated cross-zone failover capabilities. The incident, which interrupted approximately \$7 billion in financial activity, necessitated an emergency code change and manual intervention to restore service. This outage follows a similar 3-hour global trading disruption in October 2025, caused by AWS DynamoDB issues, after which Coinbase had committed to reviewing its regional deployment strategy. The author highlights a perceived deficiency in Coinbase's infrastructure resilience compared to its scale and past commitments.

Key takeaway

For engineering leaders overseeing high-value financial platforms, you must prioritize automated cross-zone failover, even when core services demand single-AZ co-location for latency. Your infrastructure strategy should explicitly address the risk of AZ outages with pre-planned, tested recovery mechanisms. Relying on manual intervention or emergency code changes during an incident is unacceptable for systems handling trillions of dollars. Implement regular failover drills to validate resilience and ensure your post-outage commitments translate into tangible architectural improvements.

Key insights

Critical financial services require automated cross-zone failover despite low-latency single-AZ design choices.

Principles

Single-AZ dependencies pose significant outage risks.
Automated failover is crucial for high-availability systems.
Post-mortem commitments require concrete implementation.

Method

Recovery involved an emergency code change to remove a startup assumption, creating a new node group, and a careful sequence to restore quorum.

In practice

Evaluate critical systems for single-AZ dependencies.
Implement automated cross-zone failover for core services.
Conduct regular failover drills for disaster preparedness.

Topics

AWS Outage
Availability Zones
Automated Failover
Site Reliability Engineering
Financial Trading Systems
Coinbase

Best for: CTO, DevOps Engineer, Software Engineer, VP of Engineering/Data

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The Pragmatic Engineer.