Ch 9 - Counting and Aggregation: Controlling the Grain

2026-03-22 · Source: Practical Data Modeling · Field: Technology & Digital — Data Science & Analytics, Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, short

Summary

Chapter 9 of "Practical Data Modeling" focuses on aggregation and counting, emphasizing that these are not merely post-model calculations but fundamental constraints built into the data model itself. The author highlights the pitfalls of relying on averages, citing examples like the U.S. Air Force's "average pilot" cockpit design that fit no one, and the ambiguity of simple counts like "active users" without precise definitions of identity, existence, discreteness, and context. The chapter explains that aggregation compresses detail for simplicity and speed, a trade-off that must reveal signal, not destroy it. It covers how different domains, from machine learning to streaming data, employ various compression techniques, and introduces structural principles for safe aggregation, particularly stressing the importance of identifying and aligning data grain to prevent issues like double counting and ambiguous interpretations.

Key takeaway

For Data Engineers and Data Scientists designing data models, recognize that aggregation is a core structural element, not an afterthought. You must explicitly define data grain and ensure its integrity throughout your model to prevent misleading metrics and ensure trustworthy, reproducible results. Prioritize clear definitions for counts and aggregations to avoid common pitfalls like double counting or ambiguous interpretations.

Key insights

Effective data modeling requires building aggregation constraints directly into the model, not just applying them afterward.

Principles

Aggregation is a model constraint, not just a calculation.
Data must share the same grain for successful aggregation.
Grain changes must be explicit, never accidental.

Method

Before counting, define what is being counted, establish existence and cardinality, determine discreteness, and account for context and scope to ensure meaningful results.

In practice

Define "active user" precisely across teams.
Document data grain for every dataset.
Use COUNT(DISTINCT user_id) for grain changes.

Topics

Data Aggregation
Data Counting
Data Grain
Safe Aggregation Principles
Disjoint Grouping

Best for: Data Engineer, Data Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Practical Data Modeling.