Implementation of SCD2 On Truncate-Load Table With No Unique Column
Summary
This article details a method for implementing Slowly Changing Dimension Type 2 (SCD2) in a Databricks Silver Delta table, even when the source Bronze Delta table lacks unique identifiers or audit columns and is subject to daily truncate-load operations. The standard Change Data Feed (CDF) feature in Delta Lake is shown to be insufficient for tracking changes (inserts, updates, deletes) under a truncate-load strategy, as it registers all records as "insert" events after each truncation. The proposed solution involves creating three temporary views: one for records to be updated (marking old records as inactive), one for newly inserted records (both fresh and updated counterparts), and one for records to be marked as deleted (not present in the source). These views are then combined into a final dataset, which is used in a MERGE INTO operation to upsert data into the Silver Delta table, maintaining historical records with `eff_start_tms`, `eff_end_tms`, and `active_flag` columns.
Key takeaway
For Data Engineers managing Databricks Delta tables with truncate-load Bronze sources, relying solely on Change Data Feed for SCD2 is insufficient. You should implement a custom merge strategy using temporary views to explicitly identify and manage inserts, updates, and deletions, ensuring accurate historical data tracking in your Silver layer. This approach is critical for maintaining data integrity and auditability.
Key insights
SCD2 implementation on truncate-load Delta tables requires custom logic beyond standard Change Data Feed.
Principles
- CDF is unreliable with truncate-load.
- SCD2 needs explicit active/inactive flags.
Method
Create temporary views for updates, new inserts (including updated counterparts), and deletions. Combine these into a final dataset, then use a MERGE INTO statement to apply changes to the Silver Delta table.
In practice
- Use `eff_start_tms`, `eff_end_tms`, `active_flag` for history.
- Define update logic by comparing source and target.
- Identify deletions via `LEFT ANTI JOIN`.
Topics
- SCD2 Implementation
- Databricks Delta Lake
- Change Data Feed
- Data Warehousing
- Truncate-Load Strategy
Best for: Data Engineer, Software Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Data Engineering on Medium.