How Reddit Migrated Petabyte-Scale Kafka from EC2 to Kubernetes

· Source: ByteByteGo Newsletter · Field: Technology & Digital — Cloud Computing & IT Infrastructure, Software Development & Engineering · Depth: Advanced, medium

Summary

Reddit's Engineering Team successfully migrated its entire Apache Kafka fleet, comprising over 500 brokers and more than a petabyte of live data, from Amazon EC2 virtual machines to Kubernetes with zero downtime and no client application changes. This complex infrastructure migration was guided by strict constraints, including maintaining Kafka availability, preserving metadata, handling tightly coupled client connectivity, and ensuring every step was reversible. The multi-phase process involved introducing a DNS facade, reconfiguring broker IDs, running a mixed EC2/Kubernetes cluster using a forked Strimzi operator, gradually shifting data with Cruise Control, and finally migrating the control plane from ZooKeeper to KRaft. This achievement demonstrates that large-scale infrastructure changes can be executed with minimal risk by employing small, reversible steps and strategic abstraction layers. Key lessons include the value of abstraction, protecting logical state over physical infrastructure, and designing for reversibility to enable confident progress.

Key takeaway

Reddit successfully migrated its petabyte-scale Apache Kafka fleet (500+ brokers) from EC2 to Kubernetes with zero downtime and no client application changes. This was achieved via a DNS facade, a forked Strimzi operator for mixed-cluster operation, and Cruise Control for incremental data shifting, culminating in a ZooKeeper to KRaft control plane migration. This blueprint offers MLOps teams a robust methodology for modernizing critical, stateful data streaming infrastructure without service interruption.

Topics

Best for: MLOps Engineer, CTO, VP of Engineering/Data, DevOps Engineer, Software Engineer, Data Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by ByteByteGo Newsletter.