How Netflix Maps Thousands of Microservices in Real-Time

· Source: InfoQ · Field: Technology & Digital — Software Development & Engineering, Cloud Computing & IT Infrastructure, Data Science & Analytics · Depth: Advanced, short

Summary

Netflix has detailed its internal Service Topology system, designed to create and update a live dependency graph for thousands of microservices in real-time. This system unifies three distinct data sources—eBPF network flow logs, IPC metrics from instrumented services, and aggregated distributed traces—into a single, queryable graph. It addresses a long-standing engineering challenge of understanding service interdependencies and blast radii during incident resolution. A three-stage aggregation pipeline resolves raw network flows, collapsing multi-hop paths into direct application-to-application connections. The processing runs on Apache Pekko Streams and Kafka, with graph storage built on Netflix's distributed key-value system, offering a gRPC API with sub-second response times for multi-hop and filtered queries. Historical queries use time-window aggregation to correlate dependency changes with incidents without high storage costs.

Key takeaway

For MLOps Engineers or DevOps teams managing complex microservice architectures, understanding real-time service dependencies is critical for incident response. You should consider integrating diverse observability data sources like eBPF, IPC metrics, and traces into a unified, queryable service topology graph. This approach provides a clear blast radius analysis and accelerates root cause identification, preventing misdiagnoses from incomplete data. Prioritize sub-second query performance and historical analysis capabilities for effective operational insights.

Key insights

Unifying diverse data sources into a real-time, queryable service topology graph enhances incident resolution.

Principles

Method

Ingest eBPF, IPC metrics, and traces; aggregate multi-hop flows into direct edges; store in a graph database for fast, multi-hop queries.

In practice

Topics

Best for: CTO, VP of Engineering/Data, DevOps Engineer, Software Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by InfoQ.