How Datadog Redefined Data Replication
Summary
Datadog addressed a critical performance issue on its Metrics Summary page, where p90 latency reached 7 seconds due to expensive database joins involving 82,000 active metrics and 817,000 configurations. Initial attempts at query optimization and indexing failed because the Postgres database, designed for OLTP, was being misused for real-time search workloads. Datadog implemented a Change Data Capture (CDC) pipeline using Debezium to read Postgres's Write-Ahead Log (WAL), stream changes to Kafka, and then push denormalized data to a dedicated search platform. They opted for asynchronous replication to prioritize speed and resilience over strong consistency, accepting minor replication lag. To manage schema evolution, Datadog developed an automated validation system for SQL migrations and a multi-tenant Kafka Schema Registry enforcing backward compatibility with Avro serialization. This solution, initially a fix for one page, evolved into a company-wide data replication platform orchestrated by Temporal, supporting various data sources and destinations.
Key takeaway
For CTOs or VPs of Engineering facing database performance bottlenecks from mixed OLTP and search workloads, consider implementing a Change Data Capture (CDC) strategy. This approach offloads search queries to specialized platforms, significantly reducing primary database load and improving latency. Evaluate the acceptable replication lag for your use cases, as asynchronous replication offers greater resilience and speed, but requires robust schema evolution management and automation for scaling across multiple data flows.
Key insights
Separate OLTP from real-time search workloads using CDC for scalable data replication.
Principles
- Databases excel at specific workloads.
- Asynchronous replication enhances resilience.
- Automate complex infrastructure provisioning.
Method
Implement CDC via Debezium and Kafka to replicate relational data to a search platform, denormalizing data and managing schema evolution with automated validation and a Schema Registry.
In practice
- Use Debezium for Postgres WAL capture.
- Employ Kafka as a durable message broker.
- Automate pipeline setup with workflow engines.
Topics
- Datadog
- Data Replication
- Change Data Capture
- Asynchronous Replication
- Schema Evolution
Best for: CTO, VP of Engineering/Data, Data Engineer, DevOps Engineer, Software Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by ByteByteGo Newsletter.