What is data pipeline architecture?
Summary
Data pipeline architecture defines the end-to-end design for collecting, processing, storing, and delivering data from source systems to users, applications, and models. This blueprint covers data flow, transformation points, and tool selection, operating at logical ("the what") and physical ("the how") design levels. All pipelines share four core layers: Ingestion (batch or streaming, often with change data capture), Processing and Transformation (cleaning, reshaping, enriching, similar to ETL), Storage (data lake, warehouse, or lakehouse, utilizing open formats like Delta Lake or Apache Iceberg), and Serving and Consumption (delivering data to BI tools, ML platforms, or operational systems). While stage models vary (3, 4, or 5 stages), the underlying work is consistent. Common architectural patterns include Batch, Streaming, Lambda, Kappa, and Medallion (a lakehouse pattern with Bronze, Silver, Gold tiers). The article also contrasts ETL (transform before load) with ELT (load before transform), highlighting ELT's prevalence in modern cloud platforms like Databricks due to scalable compute.
Key takeaway
For Data Engineers designing or optimizing data pipelines, understanding the architectural patterns and core layers is crucial. You should select patterns like Batch, Streaming, Lambda, Kappa, or Medallion based on latency and volume requirements, not generic solutions. Prioritize ELT on modern cloud platforms for flexibility and scalability, and leverage features like Delta Lake's ACID transactions for data reliability. Unifying batch and streaming with platforms like Databricks Lakeflow can also reduce operational burden.
Key insights
Data pipeline architecture is the blueprint for data flow, defined by layers, stages, and patterns tailored to use cases.
Principles
- Architecture must match the specific use case.
- All pipelines share four core layers.
- Medallion architecture organizes data into quality tiers.
In practice
- Use Delta Lake for ACID transactions and time travel.
- Consider ELT for modern cloud lakehouses.
Topics
- Data Pipeline Architecture
- Data Ingestion
- Medallion Architecture
- ETL vs ELT
- Lakehouse
- Delta Lake
Best for: Data Engineer, MLOps Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Databricks.