PIPELINE

· Source: Data Engineering on Medium · Field: Technology & Digital — Data Science & Analytics, Cloud Computing & IT Infrastructure · Depth: Intermediate, quick

Summary

The data engineering landscape has undergone a pivotal shift from traditional ETL (Extract, Transform, Load) to ELT (Extract, Load, Transform), driven by the economics of cloud computing. Historically, data was transformed pre-load due to high storage costs and limited compute resources. However, the advent of cloud services has rendered storage nearly free and compute elastic, removing these constraints. Despite this fundamental change, many practitioners continued using ETL out of habit, leading to inefficiencies. The article illustrates this with an example of a mobile application event data pipeline, designed to process user clicks, page views, and feature interactions, which ran for eight hours instead of its intended forty-five minutes, performing tasks like JSON parsing and hourly aggregation.

Key takeaway

For data engineers building or optimizing cloud-based data pipelines, the shift from ETL to ELT is critical. Your traditional ETL processes, which transform data before loading, are likely inefficient and costly given today's cheap cloud storage and elastic compute. You should re-evaluate existing long-running pipelines, prioritizing loading raw data directly into your cloud data warehouse before performing transformations to significantly improve performance and reduce operational overhead.

Key insights

Cloud economics make ELT superior to traditional ETL for modern data pipelines.

Principles

In practice

Topics

Best for: CTO, VP of Engineering/Data, Director of AI/ML, Data Engineer, Analytics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Data Engineering on Medium.