AWS Distinguished Eng: Learnings From 3000 Incidents And How Engineering Is Changing | Marc Brooker
Summary
AWS Distinguished Engineer Marc Brooker shares critical insights derived from analyzing 3,000 cloud system post-mortems and 15 years of on-call experience. He emphasizes that hands-on engagement is crucial for accurate technical understanding, contrasting it with the volume of code produced. Brooker details how customer feedback and technical trends, such as block storage becoming a default backend, drive innovation, citing Aurora serverless and D-SQL. He cautions against the uncritical use of caching due to "metastable failures." Brooker advocates for a robust post-mortem culture that delves into multi-level "whys" to identify broad organizational and technological fixes, not just proximal causes. The discussion also covers AI's impending transformation of software engineering careers, urging continuous adaptation and a return to hands-on building for senior engineers.
Key takeaway
For MLOps Engineers or AI Architects designing resilient cloud systems, you must prioritize hands-on operational experience and cultivate a rigorous post-mortem culture. Deeply investigate incidents beyond proximal causes to identify systemic issues and strategic improvements, rather than just patching. Be wary of caching's potential for metastable failures. Continuously adapt your skills to leverage new AI-powered development tools, ensuring your expertise remains grounded in current practice.
Key insights
Hands-on experience and deep understanding of system failures are paramount for effective distributed systems engineering.
Principles
- Hands-on experience prevents wrong opinions.
- Post-mortems must address multi-level causes.
- Caching introduces metastable failure risks.
Method
A good post-mortem involves deeply understanding "what happened," then iteratively asking "why" at multiple levels (code, testing, social processes) to identify concrete, broad fixes and extract patterns for new tools or services.
In practice
- Prioritize customer-driven problem solving.
- Automate repetitive on-call tasks.
- Document design decisions for future teams.
Topics
- Distributed Systems
- Post-Mortem Analysis
- Cloud Operations
- Caching Strategies
- AI in Software Engineering
- Career Development
Best for: CTO, VP of Engineering/Data, Director of AI/ML, Software Engineer, MLOps Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by The Peterman Post.