Coinbase Postmortem Reveals How a Localized AWS Failure Triggered a Multi-Hour Trading Outage
Summary
Coinbase published a detailed postmortem for its May 7, 2026, outage, revealing a localized AWS cooling failure in the US-East-1 region escalated into a multi-hour disruption, halting nearly all trading. The initial AWS thermal event took EC2 instances and EBS volumes offline. Coinbase's investigation found that architectural dependencies, specifically its Raft-based cluster matching engine operating within a single AWS Cluster Placement Group, lacked automated failover. This design, optimized for ultra-low latency, lost quorum when three of its five nodes went down. Additionally, Kafka workloads for event streaming became stranded, creating backlogs and delaying recovery. The combination of these issues transformed a localized cloud problem into a platform-wide outage, highlighting how architectural assumptions, not just underlying cloud infrastructure, determine real-world availability. Coinbase plans automated cross-zone recovery and improved messaging infrastructure.
Key takeaway
For MLOps Engineers designing high-performance, low-latency systems, you must critically evaluate architectural trade-offs between speed and resilience. Your reliance on single-zone cluster placement for performance can create critical single points of failure, as seen with Coinbase's Raft-based matching engine. Prioritize automated cross-zone failover and robust messaging infrastructure to accelerate recovery from inevitable cloud infrastructure failures, ensuring your systems maintain availability even under localized disruptions.
Key insights
Outages often stem from unexpected interactions between individually manageable failures and architectural assumptions.
Principles
- Performance optimization can compromise resilience.
- Cloud deployment doesn't guarantee resilience.
- Hidden dependencies amplify failure impact.
Method
Recovery involved emergency code changes, manual cluster reconstruction, and rebalancing Kafka partitions to restore quorum and data flow.
In practice
- Implement automated cross-zone recovery.
- Improve quorum restoration procedures.
- Expand disaster recovery testing.
Topics
- AWS Outage
- Cloud Resilience
- Distributed Systems
- Raft Consensus
- Kafka
- Disaster Recovery
Best for: CTO, VP of Engineering/Data, Product Manager, Software Engineer, DevOps Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by InfoQ.