How Uber Built Big Data System — From a Few TBs to 350 Petabytes with Sub-Hour Latency
Summary
Uber's Big Data Platform evolved through three architectural generations to manage over 350 petabytes of data, ingesting 6 trillion rows daily and serving 4 million analytical queries weekly with sub-30-minute latency. Initially, a Vertica data warehouse (Gen 1) provided a global view but failed to scale with mutable JSON data, schema drift, and high costs. Gen 2 shifted to a Hadoop data lake with Apache Parquet, Presto, Spark, and Hive, achieving horizontal scalability but suffering from 24-hour data latency due to snapshot-based ingestion. This led Uber to develop Apache Hudi (Gen 3) in 2017, a Spark library enabling upserts, incremental reads, and sub-hour latency on HDFS, later integrating with Marmaray and Apache Flink for sub-15-minute real-time streaming ingestion. Hudi's Metadata Table and Record Index further scaled to trillion-row tables and multi-data-center reliability.
Key takeaway
For MLOps Engineers or Data Architects designing large-scale data platforms, Uber's journey underscores the necessity of anticipating mutable data challenges and horizontal scalability. You should prioritize solutions like Apache Hudi for data lakes requiring upserts and incremental processing, rather than relying solely on append-only systems. Consider adopting streaming ingestion with tools like Apache Flink to achieve sub-15-minute data freshness for critical workloads, ensuring your architecture can adapt to exponential growth and diverse user needs.
Key insights
Uber's data platform evolution highlights re-architecting for mutable data at petabyte scale, leading to Apache Hudi's creation.
Principles
- Separate raw ingestion from data modeling.
- Standardize data models for mutable data.
- Re-architect before system collapse.
Method
Uber's Gen 3 architecture uses Apache Kafka for changelogs, Apache Flink for streaming ingestion, and Apache Hudi on HDFS/GCS for mutable data storage, supporting incremental reads and upserts.
In practice
- Implement Copy-on-Write for mutable data lakes.
- Use Metadata Tables to optimize file listing.
- Employ Record Index for trillion-row upserts.
Topics
- Big Data Architecture
- Apache Hudi
- Data Lake
- Data Ingestion
- Apache Flink
- HDFS
Best for: CTO, VP of Engineering/Data, Data Engineer, MLOps Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Data Engineering on Medium.