Presentation: Week-Long Outage: Lifelong Lessons
Summary
Molly Struve, a Staff Site Reliability Engineer at Netflix, details a six-day outage at her previous company, Kenna Security, caused by a critical bug in an Elasticsearch 2 to 5 upgrade. This incident, which severely impacted Kenna's core cybersecurity platform, led to CPU spikes, node crashes, and prolonged service unavailability. The resolution came from an Elastic senior engineer who identified and patched a bug in the Elasticsearch source code. Struve emphasizes both technical lessons, such as the necessity of rollback plans, FMEAs, and regular performance testing with shadow traffic, and crucial human elements like widening the incident response circle early, the importance of team cohesion, and supportive leadership that acts as a defender and cheerleader.
Key takeaway
For CTOs and VP of Engineering overseeing critical infrastructure, this account underscores that robust incident preparedness extends beyond technical fixes. You must prioritize establishing a culture of psychological safety where teams feel empowered to ask for help early and leaders act as unwavering defenders. Regularly exercising rollback plans and integrating continuous performance testing are non-negotiable to mitigate the impact of inevitable failures and transform incidents into foundational learning experiences.
Key insights
Outages offer critical learning opportunities, emphasizing both technical preparedness and human support.
Principles
- Always have a rollback plan.
- Performance test regularly.
- Widen your incident circle early.
Method
Conduct pre-mortems or Failure Mode and Effects Analysis (FMEA) before major changes. Implement shadow traffic or long-running canaries for continuous performance testing. Exercise rollback mechanisms regularly to ensure functionality.
In practice
- Use FMEA for de-risking large changes.
- Run shadow traffic against new clusters.
- Exercise rollback plans quarterly.
Topics
- Elasticsearch Upgrade
- System Outage
- Rollback Planning
- Performance Testing
- Incident Management
Best for: CTO, VP of Engineering/Data, MLOps Engineer, DevOps Engineer, Software Engineer, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by InfoQ.