4 YAML Files Instead of PySpark: How We Let Analysts Build Data Pipelines Without Engineers
Summary
Mindbox, a data engineering team, transitioned from a PySpark-dependent data pipeline architecture to a declarative, YAML and SQL-based platform to accelerate metric recalculations and empower analysts. Previously, building a single data pipeline using PySpark and Airflow took one to three weeks, requiring Python-proficient developers. The new internal platform, built around dlt (data load tool) for ingestion, dbt (data build tool) on Trino for SQL transformations, and Airflow with Cosmos for orchestration, enables analysts and product managers to create regularly updated pipelines in a single day by writing just four YAML files and SQL models. This shift reduced pipeline delivery time from weeks to one day and transferred ownership from developers to business-savvy analysts.
Key takeaway
For data analysts and product managers struggling with slow data pipeline delivery due to developer dependencies, adopting a declarative, SQL-first platform like dbt on Trino can dramatically accelerate your workflow. Focus on mastering SQL, YAML, and Git basics, and leverage tools that abstract away Python complexities to build and deploy data marts independently, reducing delivery times from weeks to a single day.
Key insights
Shifting to a declarative, SQL-centric data platform empowers analysts and drastically reduces pipeline delivery times.
Principles
- Prioritize declarative configurations over imperative code.
- Empower domain experts with self-service data tools.
- Optimize tools for specific data layer functions.
Method
Implement dlt for YAML-configured data ingestion, dbt on Trino for SQL-based data transformations, and Airflow with Cosmos for auto-generated DAG orchestration from YAML and dbt projects.
In practice
- Use dlt for API/DB ingestion via YAML.
- Transform data with dbt and Trino using pure SQL.
- Automate Airflow DAGs from dbt projects via Cosmos.
Topics
- Data Pipelines
- PySpark
- dbt
- Trino
- Airflow
Best for: Data Analyst, Data Engineer, Analytics Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.