I Deleted 4TB of Production Data by Trusting dbt’s --full-refresh Flag

· Source: Data Engineering on Medium · Field: Technology & Digital — Data Science & Analytics, Software Development & Engineering · Depth: Intermediate, quick

Summary

An analytics engineer accidentally deleted 4TB of production data, representing 18 months of historical analytics, by executing `dbt run --full-refresh --select fct_orders`. The command, intended to rebuild a specific table, replaced 847 million rows with only 124,000 rows from the day's incremental load. This incident, based on real dbt data loss events from 2022-2024, highlights a critical misunderstanding of the `--full-refresh` flag's behavior. Compounding the issue, backups had failed three weeks prior and were not validated, preventing data recovery. The event underscores the necessity of understanding destructive commands and validating backup strategies.

Key takeaway

For analytics engineers managing critical data pipelines, thoroughly understand the implications of dbt's `--full-refresh` flag and similar destructive commands. Always test such operations in a development environment first. Crucially, implement and regularly validate your backup strategy to ensure recoverability, as relying on unverified backups can turn a data loss incident into a disaster.

Key insights

Misunderstanding a command's full implications can lead to catastrophic data loss, especially without validated backups.

Principles

In practice

Topics

Best for: Analytics Engineer, Data Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Data Engineering on Medium.