SE Radio 719: Birol Yildiz on Building an Agentic AI SRE
Summary
Birol Yildiz, CEO and co-founder of iLert, details the development of their AI SRE, an autonomous agent designed for production incident response. This system addresses the inherent unpredictability of novel incidents, where traditional runbooks fall short, by leveraging reasoning models. The AI SRE's architecture comprises four key layers: an orchestration layer for routing and abstracting model providers, a knowledge layer utilizing plain text memory and agentic search (e.g., grep, jq) over vector databases, an evaluation framework based on replaying recorded live investigations, and a human-in-the-loop constraint layer. Evolving from an early browser-based concept, its current design incorporates reasoning models and the Model Context Protocol, initially published by Anthropic and supported by Google and OpenAI. iLert aims for the AI SRE to complete root cause analysis within four minutes, significantly faster than manual processes.
Key takeaway
For MLOps Engineers or AI Architects building autonomous agents, prioritize owning the model's context and avoiding overly prescriptive frameworks like LangChain to ensure full control over inputs. Trust the reasoning model's ability to determine its own trajectory, providing it with tools rather than rigid workflows. Start with read-only agent permissions and implement robust evaluation frameworks using recorded investigations to safely iterate towards greater autonomy, especially for critical tasks like incident response.
Key insights
AI agents excel in incident response by reasoning through novel problems, surpassing rigid runbook limitations.
Principles
- Own your context completely.
- Trust reasoning models' capabilities.
- Agentic search beats vector databases.
Method
Build agents with an orchestration layer, plain text knowledge base, recorded investigation-based evaluation, and human-in-the-loop constraints.
In practice
- Use agentic search (grep, jq) for knowledge.
- Record live investigations for evaluation.
- Start with read-only agent permissions.
Topics
- AI Agents
- Incident Response
- Reasoning Models
- Agentic Search
- Model Context Protocol
- Evaluation Frameworks
Best for: AI Architect, AI Engineer, MLOps Engineer, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Software Engineering Radio - the podcast for professional software developers.