Coinbase Postmortem Reveals How a Localized AWS Failure Triggered a Multi-Hour Trading Outage

· Source: InfoQ · Field: Technology & Digital — Software Development & Engineering, Cloud Computing & IT Infrastructure · Depth: Advanced, short

Summary

Coinbase published a detailed postmortem for its May 7, 2026, outage, revealing a localized AWS cooling failure in the US-East-1 region escalated into a multi-hour disruption, halting nearly all trading. The initial AWS thermal event took EC2 instances and EBS volumes offline. Coinbase's investigation found that architectural dependencies, specifically its Raft-based cluster matching engine operating within a single AWS Cluster Placement Group, lacked automated failover. This design, optimized for ultra-low latency, lost quorum when three of its five nodes went down. Additionally, Kafka workloads for event streaming became stranded, creating backlogs and delaying recovery. The combination of these issues transformed a localized cloud problem into a platform-wide outage, highlighting how architectural assumptions, not just underlying cloud infrastructure, determine real-world availability. Coinbase plans automated cross-zone recovery and improved messaging infrastructure.

Key takeaway

For MLOps Engineers designing high-performance, low-latency systems, you must critically evaluate architectural trade-offs between speed and resilience. Your reliance on single-zone cluster placement for performance can create critical single points of failure, as seen with Coinbase's Raft-based matching engine. Prioritize automated cross-zone failover and robust messaging infrastructure to accelerate recovery from inevitable cloud infrastructure failures, ensuring your systems maintain availability even under localized disruptions.

Key insights

Outages often stem from unexpected interactions between individually manageable failures and architectural assumptions.

Principles

Method

Recovery involved emergency code changes, manual cluster reconstruction, and rebalancing Kafka partitions to restore quorum and data flow.

In practice

Topics

Best for: CTO, VP of Engineering/Data, Product Manager, Software Engineer, DevOps Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by InfoQ.