Uber Launches IngestionNext: Streaming-First Data Lake Cuts Latency and Compute by 25%
Summary
Uber engineers re-architected the company's data lake ingestion platform, transitioning from a batch-oriented system to a streaming-first architecture named IngestionNext. This new platform continuously processes event streams, significantly reducing data ingestion latency from hours to minutes. The IngestionNext architecture leverages Apache Kafka for event streaming and Apache Flink jobs for processing, writing data to Apache Hudi tables with transactional capabilities. This shift enables faster data availability for analytics dashboards, experimentation platforms, and machine learning models, supporting thousands of datasets and high global data volumes. The re-architecture also improved resource efficiency, reducing compute usage by approximately 25% compared to the previous batch system.
Key takeaway
For data platform architects evaluating modernization strategies, Uber's shift to a streaming-first data lake ingestion platform demonstrates significant gains in data freshness and resource efficiency. You should consider adopting similar streaming architectures, leveraging technologies like Apache Kafka, Flink, and Hudi, to accelerate data availability for critical analytics and machine learning applications, while also optimizing compute costs.
Key insights
Streaming data ingestion reduces latency and improves data freshness for analytics and ML workloads.
Principles
- Data freshness is a key dimension of data quality.
- Continuous processing can optimize resource utilization.
Method
Implement a streaming pipeline using Apache Kafka and Flink, writing to Hudi tables with transactional commits, and managing file compaction for efficiency.
In practice
- Use Apache Hudi for transactional data lake operations.
- Implement compaction for small file issues in streaming.
- Track offsets for reliable recovery in streaming jobs.
Topics
- Data Lake Ingestion
- Streaming Data
- Apache Flink
- Apache Hudi
- Data Freshness
Code references
Best for: VP of Engineering/Data, AI Engineer, Machine Learning Engineer, Data Engineer, MLOps Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by InfoQ.