Yelp Achieves Zero-Downtime Upgrade of Over 1,000 Cassandra Nodes

· Source: InfoQ · Field: Technology & Digital — Cloud Computing & IT Infrastructure, Software Development & Engineering · Depth: Intermediate, quick

Summary

Yelp successfully executed a zero-downtime upgrade of over 1,000 Apache Cassandra nodes, demonstrating a scalable blueprint for modernizing stateful systems. The company's Database Reliability Engineering team detailed how careful planning, phased execution, and automation facilitated this seamless modernization of critical data infrastructure. This effort tackled the complex challenge of upgrading a live, highly available database without interrupting production workloads, which are essential for many of Yelp's core services. The team employed a rolling upgrade strategy, incrementally updating nodes while maintaining cluster availability and data consistency. This approach minimized cascading failures by upgrading nodes in controlled batches and allowing the cluster to rebalance between steps, aligning with best practices for Cassandra upgrades.

Key takeaway

For CTOs and VPs of Engineering managing critical data platforms, Yelp's successful Cassandra upgrade proves that zero-downtime modernization of large-scale stateful infrastructure is feasible. You should prioritize robust planning, phased execution, and significant investment in automation and observability to ensure continuous availability and seamless change, making traditional maintenance windows obsolete for your systems.

Key insights

Zero-downtime upgrades for large-scale stateful systems are achievable through meticulous planning and phased execution.

Principles

Method

Execute rolling upgrades in controlled batches, allowing rebalancing and repair between steps. Monitor and validate each phase in real time with automation and continuous health checks.

In practice

Topics

Best for: CTO, VP of Engineering/Data, Data Engineer, DevOps Engineer, Consultant

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by InfoQ.