I Thought Data Engineering Was Just Writing Scripts. I Was Wrong.

· Source: Towards Data Science · Field: Technology & Digital — Data Science & Analytics, Software Development & Engineering · Depth: Novice, medium

Summary

The article describes a data analyst's journey to understand data engineering beyond simple scripting, detailing the evolution of an ETL pipeline built using the GitHub API. Initially, the pipeline extracted 30 Python repositories created after 2025-04-22, cleaned descriptions, and saved them to a CSV file. The author encountered three significant challenges when attempting to make the pipeline "production-ready": lack of memory leading to duplicate data, data disappearance due to non-persistence, and the need for manual execution. Upgrades involved switching from CSV to a SQLite database, implementing idempotency to prevent duplicates by deleting and re-inserting records based on URL, and persisting the SQLite database to Google Drive to ensure data survival across sessions. The next identified challenge is automated scheduling using tools like Apache Airflow or Prefect, highlighting that data engineering focuses on building reliable systems, not just one-off scripts.

Key takeaway

For data analysts transitioning to data engineering, your initial ETL scripts, while functional, will likely fail in production environments. You must prioritize building reliable systems by addressing idempotency to prevent data duplication, ensuring data persistence beyond single sessions, and implementing automated scheduling. Neglecting these aspects turns a pipeline into a liability, risking data integrity and operational continuity. Start integrating these engineering principles early in your development process.

Key insights

True data engineering extends beyond scripting to build reliable systems through idempotency, persistence, and scheduling.

Principles

Method

An ETL pipeline was upgraded by replacing CSV output with SQLite, implementing a delete-then-insert strategy for idempotency, and persisting the database to Google Drive for session-independent data storage.

In practice

Topics

Best for: Data Analyst, Data Engineer, AI Student

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.