How LinkedIn Built a Next-Gen Service Discovery for 1000s of Services
Summary
LinkedIn has successfully migrated its core service discovery infrastructure from a decade-old Apache ZooKeeper-based system to a new "Next-Gen Service Discovery" architecture. The previous system, which used ZooKeeper as a centralized registry for D2-formatted endpoint addresses, faced critical scalability, compatibility, and extensibility issues, projected to reach capacity by early 2025 due to read storms and strong consistency limitations. The new architecture separates read and write paths, utilizing Kafka for writes and a custom-built, Go-based Service Discovery Observer for reads, which pushes updates via gRPC streams using the xDS protocol. This transformation has resulted in a tenfold improvement in median data propagation latency (P50 < 1 second) and a sixfold improvement in 99th percentile latency (P99 < 5 seconds), enhancing reliability and enabling modern service mesh features and cross-fabric capabilities.
Key takeaway
For DevOps Engineers managing large-scale microservice architectures, consider adopting a decoupled service discovery model that prioritizes eventual consistency and leverages modern streaming platforms like Kafka and protocols like xDS. Your teams should evaluate a dual-mode migration strategy to safely transition critical infrastructure, using comprehensive metrics and automated dependency analysis to prevent outages and accelerate adoption across diverse application stacks.
Key insights
Separating read/write paths and adopting eventual consistency significantly improves service discovery scalability and reliability.
Principles
- Prioritize availability over strong consistency for service discovery.
- Decouple read and write operations for distributed systems.
- Use industry-standard protocols for broader compatibility.
Method
LinkedIn's Next-Gen Service Discovery uses Kafka for server writes and heartbeats, and a Go-based Observer for client reads via gRPC streams and xDS protocol, pushing updates instead of polling.
In practice
- Implement dual-mode migration to verify new systems in production.
- Use automated dependency analysis to identify migration blockers.
- Monitor end-to-end propagation latency for critical system updates.
Topics
- Service Discovery
- Microservices Architecture
- Distributed Systems Migration
- Kafka
- xDS Protocol
Best for: Software Engineer, DevOps Engineer, AI Operations Specialist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by ByteByteGo Newsletter.