Reliability fail: No automated zone failover for Coinbase’s global trading service

· Source: The Pragmatic Engineer · Field: Technology & Digital — Cloud Computing & IT Infrastructure, Software Development & Engineering · Depth: Intermediate, short

Summary

On May 7, 2026, Coinbase experienced a nearly 10-hour global trading outage, coinciding with a regional AWS disruption. The \$40B company, which processes \$5.2 trillion annually, confirmed its core matching engine was intentionally confined to a single AWS Availability Zone (AZ) to meet low-latency demands for its Raft-based replicated cluster. However, Coinbase critically lacked automated cross-zone failover capabilities. The incident, which interrupted approximately \$7 billion in financial activity, necessitated an emergency code change and manual intervention to restore service. This outage follows a similar 3-hour global trading disruption in October 2025, caused by AWS DynamoDB issues, after which Coinbase had committed to reviewing its regional deployment strategy. The author highlights a perceived deficiency in Coinbase's infrastructure resilience compared to its scale and past commitments.

Key takeaway

For engineering leaders overseeing high-value financial platforms, you must prioritize automated cross-zone failover, even when core services demand single-AZ co-location for latency. Your infrastructure strategy should explicitly address the risk of AZ outages with pre-planned, tested recovery mechanisms. Relying on manual intervention or emergency code changes during an incident is unacceptable for systems handling trillions of dollars. Implement regular failover drills to validate resilience and ensure your post-outage commitments translate into tangible architectural improvements.

Key insights

Critical financial services require automated cross-zone failover despite low-latency single-AZ design choices.

Principles

Method

Recovery involved an emergency code change to remove a startup assumption, creating a new node group, and a careful sequence to restore quorum.

In practice

Topics

Best for: CTO, DevOps Engineer, Software Engineer, VP of Engineering/Data

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The Pragmatic Engineer.