How Uber Built Big Data System — From a Few TBs to 350 Petabytes with Sub-Hour Latency

2026-06-21 · Source: Data Engineering on Medium · Field: Technology & Digital — Data Science & Analytics, Cloud Computing & IT Infrastructure, Software Development & Engineering · Depth: Expert, long

Summary

Uber's Big Data Platform evolved through three architectural generations to manage over 350 petabytes of data, ingesting 6 trillion rows daily and serving 4 million analytical queries weekly with sub-30-minute latency. Initially, a Vertica data warehouse (Gen 1) provided a global view but failed to scale with mutable JSON data, schema drift, and high costs. Gen 2 shifted to a Hadoop data lake with Apache Parquet, Presto, Spark, and Hive, achieving horizontal scalability but suffering from 24-hour data latency due to snapshot-based ingestion. This led Uber to develop Apache Hudi (Gen 3) in 2017, a Spark library enabling upserts, incremental reads, and sub-hour latency on HDFS, later integrating with Marmaray and Apache Flink for sub-15-minute real-time streaming ingestion. Hudi's Metadata Table and Record Index further scaled to trillion-row tables and multi-data-center reliability.

Key takeaway

For MLOps Engineers or Data Architects designing large-scale data platforms, Uber's journey underscores the necessity of anticipating mutable data challenges and horizontal scalability. You should prioritize solutions like Apache Hudi for data lakes requiring upserts and incremental processing, rather than relying solely on append-only systems. Consider adopting streaming ingestion with tools like Apache Flink to achieve sub-15-minute data freshness for critical workloads, ensuring your architecture can adapt to exponential growth and diverse user needs.

Key insights

Uber's data platform evolution highlights re-architecting for mutable data at petabyte scale, leading to Apache Hudi's creation.

Principles

Separate raw ingestion from data modeling.
Standardize data models for mutable data.
Re-architect before system collapse.

Method

Uber's Gen 3 architecture uses Apache Kafka for changelogs, Apache Flink for streaming ingestion, and Apache Hudi on HDFS/GCS for mutable data storage, supporting incremental reads and upserts.

In practice

Implement Copy-on-Write for mutable data lakes.
Use Metadata Tables to optimize file listing.
Employ Record Index for trillion-row upserts.

Topics

Big Data Architecture
Apache Hudi
Data Lake
Data Ingestion
Apache Flink
HDFS

Best for: CTO, VP of Engineering/Data, Data Engineer, MLOps Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Data Engineering on Medium.