EventADL: Open-Box Anomaly Detection and Localization Framework for Events in Cloud-Based Service Systems

· Source: cs.LG updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure, Data Science & Analytics · Depth: Expert, extended

Summary

EventADL is the first open-box anomaly detection and localization (ADL) framework specifically designed for event data in cloud-based service systems. It addresses a gap where existing ADL solutions primarily focus on metric and log data. The framework's design is informed by a systematic analysis of 520 real-world incidents, which revealed that anomalies manifest through Event Type (21%), Event Value (68%), and Event Frequency (67%), and root causes often involve single (32%) or multiple (68%) interventions. EventADL operates in three phases: offline training to learn Event Semantic Patterns (ESPs) and Event Frequency Patterns (EFPs), online anomaly detection by comparing incoming events against these patterns, and root cause localization using an Intervention Graph and a time-aware random walk. Evaluated on three real cloud service systems and two real-world incidents, EventADL achieved F1-scores of at least 90% for anomaly detection and 100% top-3 accuracy in root cause localization, outperforming existing methods.

Key takeaway

For research scientists developing cloud system observability tools, EventADL demonstrates a robust, interpretable approach to event-based anomaly detection and root cause localization. You should consider integrating explicit semantic and frequency pattern learning, alongside intervention graph-based causal tracing, into your next-generation ADL frameworks. This will enhance diagnostic accuracy and provide actionable insights, reducing manual investigation time for complex incidents.

Key insights

EventADL provides open-box anomaly detection and root cause localization for cloud event data, leveraging semantic and frequency patterns.

Principles

Method

EventADL learns Event Semantic Patterns (ESPs) and Event Frequency Patterns (EFPs) offline. Online, it detects deviations from these patterns and uses an Intervention Graph with a time-aware random walk for root cause localization.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, MLOps Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.