How Reddit Migrated Petabyte-Scale Kafka from EC2 to Kubernetes
Summary
Reddit's Engineering Team successfully migrated its entire Apache Kafka fleet, comprising over 500 brokers and more than a petabyte of live data, from Amazon EC2 virtual machines to Kubernetes with zero downtime and no client application changes. This complex infrastructure migration was guided by strict constraints, including maintaining Kafka availability, preserving metadata, handling tightly coupled client connectivity, and ensuring every step was reversible. The multi-phase process involved introducing a DNS facade, reconfiguring broker IDs, running a mixed EC2/Kubernetes cluster using a forked Strimzi operator, gradually shifting data with Cruise Control, and finally migrating the control plane from ZooKeeper to KRaft. This achievement demonstrates that large-scale infrastructure changes can be executed with minimal risk by employing small, reversible steps and strategic abstraction layers. Key lessons include the value of abstraction, protecting logical state over physical infrastructure, and designing for reversibility to enable confident progress.
Key takeaway
Reddit successfully migrated its petabyte-scale Apache Kafka fleet (500+ brokers) from EC2 to Kubernetes with zero downtime and no client application changes. This was achieved via a DNS facade, a forked Strimzi operator for mixed-cluster operation, and Cruise Control for incremental data shifting, culminating in a ZooKeeper to KRaft control plane migration. This blueprint offers MLOps teams a robust methodology for modernizing critical, stateful data streaming infrastructure without service interruption.
Topics
- Apache Kafka
- Kubernetes
- Infrastructure Migration
- Strimzi
- KRaft
Best for: MLOps Engineer, CTO, VP of Engineering/Data, DevOps Engineer, Software Engineer, Data Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by ByteByteGo Newsletter.