Pinterest’s CDC-Powered Ingestion Slashes Database Latency from 24 Hours to 15 Minutes
Summary
Pinterest has implemented a new database ingestion framework, transitioning from a legacy batch-based system that caused data latency exceeding 24 hours and inefficient resource use. The new architecture, built on Change Data Capture (Debezium/TiCDC), Kafka, Flink, Spark, and Iceberg, now provides access to online database changes within minutes, typically 15 minutes. This unified, configuration-driven framework supports MySQL, TiDB, and KVStore, processing only changed records (around 5% daily) instead of full tables, which significantly reduces infrastructure costs. The system uses append-only CDC tables and base tables updated via Spark Merge Into operations, standardizing on Iceberg's Merge on Read strategy to manage petabyte-scale data efficiently on AWS S3, while also addressing small file problems and enabling incremental updates and deletions.
Key takeaway
For AI Architects and CTOs evaluating data ingestion strategies, Pinterest's success with CDC-powered ingestion demonstrates a viable path to significantly reduce data latency from over 24 hours to 15 minutes. Your teams should consider adopting a similar framework leveraging Debezium/TiCDC, Kafka, Flink, Spark, and Iceberg, particularly standardizing on Iceberg's Merge on Read, to achieve substantial infrastructure cost savings and improve real-time data availability for critical ML and analytics workloads.
Key insights
CDC-powered ingestion dramatically reduces data latency and infrastructure costs by processing only changed records.
Principles
- Separate CDC from base tables
- Standardize on Merge on Read
- Partition tables by primary key hash
Method
Utilize Debezium/TiCDC for change capture, Kafka for streaming, Flink for real-time processing, Spark for batch updates, and Iceberg for table format, standardizing on Merge on Read for cost-effective updates.
In practice
- Implement CDC for real-time data
- Use Iceberg's Merge on Read
- Partition tables for parallel upserts
Topics
- Change Data Capture
- Data Ingestion Frameworks
- Apache Iceberg
- Real-time Data Processing
- Database Latency Reduction
Code references
Best for: AI Architect, CTO, VP of Engineering/Data, Data Engineer, MLOps Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by InfoQ.