How Uber Built Big Data System — From a Few TBs to 350 Petabytes with Sub-Hour Latency

· Source: Data Engineering on Medium · Field: Technology & Digital — Data Science & Analytics, Cloud Computing & IT Infrastructure, Software Development & Engineering · Depth: Expert, long

Summary

Uber's Big Data Platform evolved through three architectural generations to manage over 350 petabytes of data, ingesting 6 trillion rows daily and serving 4 million analytical queries weekly with sub-30-minute latency. Initially, a Vertica data warehouse (Gen 1) provided a global view but failed to scale with mutable JSON data, schema drift, and high costs. Gen 2 shifted to a Hadoop data lake with Apache Parquet, Presto, Spark, and Hive, achieving horizontal scalability but suffering from 24-hour data latency due to snapshot-based ingestion. This led Uber to develop Apache Hudi (Gen 3) in 2017, a Spark library enabling upserts, incremental reads, and sub-hour latency on HDFS, later integrating with Marmaray and Apache Flink for sub-15-minute real-time streaming ingestion. Hudi's Metadata Table and Record Index further scaled to trillion-row tables and multi-data-center reliability.

Key takeaway

For MLOps Engineers or Data Architects designing large-scale data platforms, Uber's journey underscores the necessity of anticipating mutable data challenges and horizontal scalability. You should prioritize solutions like Apache Hudi for data lakes requiring upserts and incremental processing, rather than relying solely on append-only systems. Consider adopting streaming ingestion with tools like Apache Flink to achieve sub-15-minute data freshness for critical workloads, ensuring your architecture can adapt to exponential growth and diverse user needs.

Key insights

Uber's data platform evolution highlights re-architecting for mutable data at petabyte scale, leading to Apache Hudi's creation.

Principles

Method

Uber's Gen 3 architecture uses Apache Kafka for changelogs, Apache Flink for streaming ingestion, and Apache Hudi on HDFS/GCS for mutable data storage, supporting incremental reads and upserts.

In practice

Topics

Best for: CTO, VP of Engineering/Data, Data Engineer, MLOps Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Data Engineering on Medium.