From Load Tests to Live Streams: Graph Embedding-Based Anomaly Detection in Microservice Architectures
Summary
Prime Video has developed a graph-based anomaly detection system to identify under-represented services during load tests, which often miss unique behaviors of real event traffic. The system, built on a Graph Convolutional Autoencoder (GCN-GAE), learns structural representations from directed, weighted microservice graphs at minute-level resolution. It flags anomalies by comparing cosine similarity between load test and live event embeddings. This approach has demonstrated early detection capabilities, identifying incident-related services documented in Correction of Error (CoE) reports up to three minutes before alarms. A preliminary synthetic anomaly injection framework shows a precision of 96% and a false positive rate of 0.08%, though recall is 58%. The system offers practical utility for Prime Video, providing a foundation for broader application across microservice ecosystems by bridging the gap between simulated and real-world service behavior.
Key takeaway
For research scientists developing reliability tools for large-scale microservice architectures, this work demonstrates that unsupervised graph-based anomaly detection can effectively identify critical service deviations. You should consider implementing structural graph embeddings, like those from a GCN-GAE, to compare simulated load test traffic against real event data. This approach offers early detection and can validate changes, but be prepared to integrate contextual data and prioritize alerts to manage false positives and improve explainability.
Key insights
Unsupervised graph embeddings can detect microservice anomalies by comparing structural deviations between simulated and live traffic.
Principles
- Structural graph deviations suffice for anomaly detection.
- Contextual signals enhance anomaly interpretability.
- Prioritization reduces noise in anomaly alerts.
Method
The system uses a GCN-GAE to learn node embeddings from minute-level service graphs. Anomalies are detected by comparing cosine similarity between gameday and event embeddings, flagging services below a 0.98 threshold.
In practice
- Use GCN-GAE for dynamic graph anomaly detection.
- Normalize edge weights for cross-snapshot comparisons.
- Incorporate deployment metadata for anomaly validation.
Topics
- Graph Embedding
- Anomaly Detection
- Microservice Architecture
- Graph Neural Networks
- Prime Video
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.