How Netflix Maps Thousands of Microservices in Real-Time
Summary
Netflix has detailed its internal Service Topology system, designed to create and update a live dependency graph for thousands of microservices in real-time. This system unifies three distinct data sources—eBPF network flow logs, IPC metrics from instrumented services, and aggregated distributed traces—into a single, queryable graph. It addresses a long-standing engineering challenge of understanding service interdependencies and blast radii during incident resolution. A three-stage aggregation pipeline resolves raw network flows, collapsing multi-hop paths into direct application-to-application connections. The processing runs on Apache Pekko Streams and Kafka, with graph storage built on Netflix's distributed key-value system, offering a gRPC API with sub-second response times for multi-hop and filtered queries. Historical queries use time-window aggregation to correlate dependency changes with incidents without high storage costs.
Key takeaway
For MLOps Engineers or DevOps teams managing complex microservice architectures, understanding real-time service dependencies is critical for incident response. You should consider integrating diverse observability data sources like eBPF, IPC metrics, and traces into a unified, queryable service topology graph. This approach provides a clear blast radius analysis and accelerates root cause identification, preventing misdiagnoses from incomplete data. Prioritize sub-second query performance and historical analysis capabilities for effective operational insights.
Key insights
Unifying diverse data sources into a real-time, queryable service topology graph enhances incident resolution.
Principles
- Merge diverse data sources for comprehensive views.
- Incomplete data is worse than no data.
- Real-time maps are crucial for dynamic environments.
Method
Ingest eBPF, IPC metrics, and traces; aggregate multi-hop flows into direct edges; store in a graph database for fast, multi-hop queries.
In practice
- Use eBPF for uninstrumented service visibility.
- Implement time-window aggregation for historical views.
- Design for sub-second query response times.
Topics
- Microservices
- Service Topology
- Real-time Observability
- eBPF
- Graph Databases
- Incident Management
Best for: CTO, VP of Engineering/Data, DevOps Engineer, Software Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by InfoQ.