SQL Boo-Boos #3: Why My Row Count Doubled After a Join (And Why That’s Way More Dangerous Than You…

· Source: Data Engineering on Medium · Field: Technology & Digital — Data Science & Analytics, Software Development & Engineering · Depth: Intermediate, medium

Summary

This article addresses a common and dangerous SQL error: the one-to-many join duplicate, which silently inflates aggregated results without throwing errors. It illustrates how joining a `customers` table (one row per customer) with an `orders` table (many rows per customer) correctly produces multiple rows per customer in the joined result. However, the "boo-boo" occurs when a second "many" table, like `customer_scores` (multiple scores per customer), is introduced. This creates an intermediate result where each customer's order rows are multiplied by their score rows, leading to drastically inflated sums, such as doubling Jane Doe's actual revenue from 700 to 1,400. The article emphasizes that SQL faithfully executes instructions, not intentions, and this multiplication of rows before aggregation is the root cause of incorrect reports.

Key takeaway

For Data Analysts and Data Scientists building revenue or customer reports, always verify the grain of your tables before joining. If a child table has multiple rows per key, pre-aggregate it to a single row per key using `GROUP BY` or `ROW_NUMBER()` to prevent silent data inflation. Your reports will be accurate, and you will avoid embarrassing discrepancies with finance.

Key insights

One-to-many joins can silently inflate aggregated results, leading to critical reporting errors.

Principles

Method

To prevent join duplicates, aggregate child tables (e.g., `orders`, `customer_scores`) to the parent's grain (e.g., one row per `customer_id`) using `GROUP BY` or `ROW_NUMBER()` before performing the final join.

In practice

Topics

Best for: Data Analyst, Data Scientist, Analytics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Data Engineering on Medium.