EventADL: Open-Box Anomaly Detection and Localization Framework for Events in Cloud-Based Service Systems
Summary
EventADL is the first open-box anomaly detection and localization (ADL) framework specifically designed for event data in cloud-based service systems. It addresses a gap where existing ADL solutions primarily focus on metric and log data. The framework's design is informed by a systematic analysis of 520 real-world incidents, which revealed that anomalies manifest through Event Type (21%), Event Value (68%), and Event Frequency (67%), and root causes often involve single (32%) or multiple (68%) interventions. EventADL operates in three phases: offline training to learn Event Semantic Patterns (ESPs) and Event Frequency Patterns (EFPs), online anomaly detection by comparing incoming events against these patterns, and root cause localization using an Intervention Graph and a time-aware random walk. Evaluated on three real cloud service systems and two real-world incidents, EventADL achieved F1-scores of at least 90% for anomaly detection and 100% top-3 accuracy in root cause localization, outperforming existing methods.
Key takeaway
For research scientists developing cloud system observability tools, EventADL demonstrates a robust, interpretable approach to event-based anomaly detection and root cause localization. You should consider integrating explicit semantic and frequency pattern learning, alongside intervention graph-based causal tracing, into your next-generation ADL frameworks. This will enhance diagnostic accuracy and provide actionable insights, reducing manual investigation time for complex incidents.
Key insights
EventADL provides open-box anomaly detection and root cause localization for cloud event data, leveraging semantic and frequency patterns.
Principles
- Anomalies manifest across event type, value, and frequency.
- Root causes often involve single or chained interventions.
- Interpretability is crucial for effective incident response.
Method
EventADL learns Event Semantic Patterns (ESPs) and Event Frequency Patterns (EFPs) offline. Online, it detects deviations from these patterns and uses an Intervention Graph with a time-aware random walk for root cause localization.
In practice
- Monitor Event Type, Event Value, and Event Frequency for comprehensive anomaly detection.
- Utilize Intervention Graphs to trace causal paths from anomalies to root causes.
- Employ time-aware random walks for automated root cause ranking.
Topics
- EventADL Framework
- Anomaly Detection
- Root Cause Localization
- Cloud Service Systems
- Event Data Analysis
Code references
Best for: Research Scientist, AI Scientist, MLOps Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.