Reliability fail: No automated zone failover for Coinbase’s global trading service
Summary
On May 7, 2026, Coinbase experienced a nearly 10-hour global trading outage, coinciding with a regional AWS disruption. The \$40B company, which processes \$5.2 trillion annually, confirmed its core matching engine was intentionally confined to a single AWS Availability Zone (AZ) to meet low-latency demands for its Raft-based replicated cluster. However, Coinbase critically lacked automated cross-zone failover capabilities. The incident, which interrupted approximately \$7 billion in financial activity, necessitated an emergency code change and manual intervention to restore service. This outage follows a similar 3-hour global trading disruption in October 2025, caused by AWS DynamoDB issues, after which Coinbase had committed to reviewing its regional deployment strategy. The author highlights a perceived deficiency in Coinbase's infrastructure resilience compared to its scale and past commitments.
Key takeaway
For engineering leaders overseeing high-value financial platforms, you must prioritize automated cross-zone failover, even when core services demand single-AZ co-location for latency. Your infrastructure strategy should explicitly address the risk of AZ outages with pre-planned, tested recovery mechanisms. Relying on manual intervention or emergency code changes during an incident is unacceptable for systems handling trillions of dollars. Implement regular failover drills to validate resilience and ensure your post-outage commitments translate into tangible architectural improvements.
Key insights
Critical financial services require automated cross-zone failover despite low-latency single-AZ design choices.
Principles
- Single-AZ dependencies pose significant outage risks.
- Automated failover is crucial for high-availability systems.
- Post-mortem commitments require concrete implementation.
Method
Recovery involved an emergency code change to remove a startup assumption, creating a new node group, and a careful sequence to restore quorum.
In practice
- Evaluate critical systems for single-AZ dependencies.
- Implement automated cross-zone failover for core services.
- Conduct regular failover drills for disaster preparedness.
Topics
- AWS Outage
- Availability Zones
- Automated Failover
- Site Reliability Engineering
- Financial Trading Systems
- Coinbase
Best for: CTO, DevOps Engineer, Software Engineer, VP of Engineering/Data
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by The Pragmatic Engineer.