How Datadog Redefined Data Replication

2025-12-15 · Source: ByteByteGo Newsletter · Field: Technology & Digital — Software Development & Engineering, Cloud Computing & IT Infrastructure · Depth: Intermediate, medium

Summary

Datadog addressed a critical performance issue on its Metrics Summary page, where p90 latency reached 7 seconds due to expensive database joins involving 82,000 active metrics and 817,000 configurations. Initial attempts at query optimization and indexing failed because the Postgres database, designed for OLTP, was being misused for real-time search workloads. Datadog implemented a Change Data Capture (CDC) pipeline using Debezium to read Postgres's Write-Ahead Log (WAL), stream changes to Kafka, and then push denormalized data to a dedicated search platform. They opted for asynchronous replication to prioritize speed and resilience over strong consistency, accepting minor replication lag. To manage schema evolution, Datadog developed an automated validation system for SQL migrations and a multi-tenant Kafka Schema Registry enforcing backward compatibility with Avro serialization. This solution, initially a fix for one page, evolved into a company-wide data replication platform orchestrated by Temporal, supporting various data sources and destinations.

Key takeaway

For CTOs or VPs of Engineering facing database performance bottlenecks from mixed OLTP and search workloads, consider implementing a Change Data Capture (CDC) strategy. This approach offloads search queries to specialized platforms, significantly reducing primary database load and improving latency. Evaluate the acceptable replication lag for your use cases, as asynchronous replication offers greater resilience and speed, but requires robust schema evolution management and automation for scaling across multiple data flows.

Key insights

Separate OLTP from real-time search workloads using CDC for scalable data replication.

Principles

Databases excel at specific workloads.
Asynchronous replication enhances resilience.
Automate complex infrastructure provisioning.

Method

Implement CDC via Debezium and Kafka to replicate relational data to a search platform, denormalizing data and managing schema evolution with automated validation and a Schema Registry.

In practice

Use Debezium for Postgres WAL capture.
Employ Kafka as a durable message broker.
Automate pipeline setup with workflow engines.

Topics

Datadog
Data Replication
Change Data Capture
Asynchronous Replication
Schema Evolution

Best for: CTO, VP of Engineering/Data, Data Engineer, DevOps Engineer, Software Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by ByteByteGo Newsletter.