4 YAML Files Instead of PySpark: How We Let Analysts Build Data Pipelines Without Engineers

· Source: Towards Data Science · Field: Technology & Digital — Data Science & Analytics, Software Development & Engineering · Depth: Intermediate, long

Summary

Mindbox, a data engineering team, transitioned from a PySpark-dependent data pipeline architecture to a declarative, YAML and SQL-based platform to accelerate metric recalculations and empower analysts. Previously, building a single data pipeline using PySpark and Airflow took one to three weeks, requiring Python-proficient developers. The new internal platform, built around dlt (data load tool) for ingestion, dbt (data build tool) on Trino for SQL transformations, and Airflow with Cosmos for orchestration, enables analysts and product managers to create regularly updated pipelines in a single day by writing just four YAML files and SQL models. This shift reduced pipeline delivery time from weeks to one day and transferred ownership from developers to business-savvy analysts.

Key takeaway

For data analysts and product managers struggling with slow data pipeline delivery due to developer dependencies, adopting a declarative, SQL-first platform like dbt on Trino can dramatically accelerate your workflow. Focus on mastering SQL, YAML, and Git basics, and leverage tools that abstract away Python complexities to build and deploy data marts independently, reducing delivery times from weeks to a single day.

Key insights

Shifting to a declarative, SQL-centric data platform empowers analysts and drastically reduces pipeline delivery times.

Principles

Method

Implement dlt for YAML-configured data ingestion, dbt on Trino for SQL-based data transformations, and Airflow with Cosmos for auto-generated DAG orchestration from YAML and dbt projects.

In practice

Topics

Best for: Data Analyst, Data Engineer, Analytics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.