Coinbase Postmortem Reveals How a Localized AWS Failure Triggered a Multi-Hour Trading Outage

2026-06-16 · Source: InfoQ · Field: Technology & Digital — Software Development & Engineering, Cloud Computing & IT Infrastructure · Depth: Advanced, short

Summary

Coinbase published a detailed postmortem for its May 7, 2026, outage, revealing a localized AWS cooling failure in the US-East-1 region escalated into a multi-hour disruption, halting nearly all trading. The initial AWS thermal event took EC2 instances and EBS volumes offline. Coinbase's investigation found that architectural dependencies, specifically its Raft-based cluster matching engine operating within a single AWS Cluster Placement Group, lacked automated failover. This design, optimized for ultra-low latency, lost quorum when three of its five nodes went down. Additionally, Kafka workloads for event streaming became stranded, creating backlogs and delaying recovery. The combination of these issues transformed a localized cloud problem into a platform-wide outage, highlighting how architectural assumptions, not just underlying cloud infrastructure, determine real-world availability. Coinbase plans automated cross-zone recovery and improved messaging infrastructure.

Key takeaway

For MLOps Engineers designing high-performance, low-latency systems, you must critically evaluate architectural trade-offs between speed and resilience. Your reliance on single-zone cluster placement for performance can create critical single points of failure, as seen with Coinbase's Raft-based matching engine. Prioritize automated cross-zone failover and robust messaging infrastructure to accelerate recovery from inevitable cloud infrastructure failures, ensuring your systems maintain availability even under localized disruptions.

Key insights

Outages often stem from unexpected interactions between individually manageable failures and architectural assumptions.

Principles

Performance optimization can compromise resilience.
Cloud deployment doesn't guarantee resilience.
Hidden dependencies amplify failure impact.

Method

Recovery involved emergency code changes, manual cluster reconstruction, and rebalancing Kafka partitions to restore quorum and data flow.

In practice

Implement automated cross-zone recovery.
Improve quorum restoration procedures.
Expand disaster recovery testing.

Topics

AWS Outage
Cloud Resilience
Distributed Systems
Raft Consensus
Kafka
Disaster Recovery

Best for: CTO, VP of Engineering/Data, Product Manager, Software Engineer, DevOps Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by InfoQ.