What is a Data Lakehouse?
Summary
A data lakehouse is a modern data architecture that unifies the reliability of a data warehouse with the massive scale of a data lake, aiming to replace separate systems. It addresses the challenges of managing distinct data ingestion paths, quality checks, and access models that arise when using both a data warehouse for curated analytics and a data lake for raw, unstructured data. The architecture is built upon a single object storage layer, utilizing open table formats like Apache Iceberg, Delta Lake, or Apache Hudi to enforce database-like ACID transactions and consistent views. A shared catalog ensures different tools, such as Apache Spark for ingestion and Trino for querying, can access the latest data version. Furthermore, a governance layer, exemplified by AWS Lake Formation or Databricks Unity Catalog, manages access rules and data lineage. While offering unified data access for diverse workloads like batch, streaming, and machine learning, lakehouses require platform engineering effort for tasks like file optimization and careful schema management, presenting a trade-off between flexibility/scale and operational complexity.
Key takeaway
For AI Architects evaluating data infrastructure, a data lakehouse offers a scalable solution for diverse workloads like streaming analytics and machine learning, consolidating data warehouse reliability with data lake economics. However, you must account for dedicated platform engineering time to manage file optimization and schema evolution. Prioritize matching your architecture choice to your team's size and specific workload requirements to avoid operational overhead.
Key insights
A data lakehouse unifies data warehouse reliability and data lake scale on a single storage layer using open table formats.
Principles
- Unify data layers to reduce duplication.
- Open table formats ensure data consistency.
- Governance layers secure shared data access.
Method
Build a data lakehouse with a single object storage layer, an open table format (e.g., Apache Iceberg), a shared catalog for metadata, and a governance layer (e.g., AWS Lake Formation) for access control.
In practice
- Use Parquet for optimized file storage.
- Schedule background jobs to merge tiny files.
- Test core data types across query engines.
Topics
- Data Lakehouse
- Data Warehouse
- Data Lake
- Apache Iceberg
- Data Governance
- Object Storage
- Data Architecture
Best for: Data Engineer, AI Architect, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by ByteByteGo.