Presentation: Week-Long Outage: Lifelong Lessons

2026-04-28 · Source: InfoQ · Field: Technology & Digital — Software Development & Engineering, Cloud Computing & IT Infrastructure, Site Reliability Engineering · Depth: Intermediate, extended

Summary

Molly Struve, a Staff Site Reliability Engineer at Netflix, details a six-day outage at her previous company, Kenna Security, caused by a critical bug in an Elasticsearch 2 to 5 upgrade. This incident, which severely impacted Kenna's core cybersecurity platform, led to CPU spikes, node crashes, and prolonged service unavailability. The resolution came from an Elastic senior engineer who identified and patched a bug in the Elasticsearch source code. Struve emphasizes both technical lessons, such as the necessity of rollback plans, FMEAs, and regular performance testing with shadow traffic, and crucial human elements like widening the incident response circle early, the importance of team cohesion, and supportive leadership that acts as a defender and cheerleader.

Key takeaway

For CTOs and VP of Engineering overseeing critical infrastructure, this account underscores that robust incident preparedness extends beyond technical fixes. You must prioritize establishing a culture of psychological safety where teams feel empowered to ask for help early and leaders act as unwavering defenders. Regularly exercising rollback plans and integrating continuous performance testing are non-negotiable to mitigate the impact of inevitable failures and transform incidents into foundational learning experiences.

Key insights

Outages offer critical learning opportunities, emphasizing both technical preparedness and human support.

Principles

Always have a rollback plan.
Performance test regularly.
Widen your incident circle early.

Method

Conduct pre-mortems or Failure Mode and Effects Analysis (FMEA) before major changes. Implement shadow traffic or long-running canaries for continuous performance testing. Exercise rollback mechanisms regularly to ensure functionality.

In practice

Use FMEA for de-risking large changes.
Run shadow traffic against new clusters.
Exercise rollback plans quarterly.

Topics

Elasticsearch Upgrade
System Outage
Rollback Planning
Performance Testing
Incident Management

Best for: CTO, VP of Engineering/Data, MLOps Engineer, DevOps Engineer, Software Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by InfoQ.