From Load Tests to Live Streams: Graph Embedding-Based Anomaly Detection in Microservice Architectures

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Cloud Computing & IT Infrastructure · Depth: Advanced, long

Summary

Prime Video has developed a graph-based anomaly detection system to identify under-represented services during load tests, which often miss unique behaviors of real event traffic. The system, built on a Graph Convolutional Autoencoder (GCN-GAE), learns structural representations from directed, weighted microservice graphs at minute-level resolution. It flags anomalies by comparing cosine similarity between load test and live event embeddings. This approach has demonstrated early detection capabilities, identifying incident-related services documented in Correction of Error (CoE) reports up to three minutes before alarms. A preliminary synthetic anomaly injection framework shows a precision of 96% and a false positive rate of 0.08%, though recall is 58%. The system offers practical utility for Prime Video, providing a foundation for broader application across microservice ecosystems by bridging the gap between simulated and real-world service behavior.

Key takeaway

For research scientists developing reliability tools for large-scale microservice architectures, this work demonstrates that unsupervised graph-based anomaly detection can effectively identify critical service deviations. You should consider implementing structural graph embeddings, like those from a GCN-GAE, to compare simulated load test traffic against real event data. This approach offers early detection and can validate changes, but be prepared to integrate contextual data and prioritize alerts to manage false positives and improve explainability.

Key insights

Unsupervised graph embeddings can detect microservice anomalies by comparing structural deviations between simulated and live traffic.

Principles

Method

The system uses a GCN-GAE to learn node embeddings from minute-level service graphs. Anomalies are detected by comparing cosine similarity between gameday and event embeddings, flagging services below a 0.98 threshold.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.