Ch 9 - Counting and Aggregation: Controlling the Grain
Summary
Chapter 9 of "Practical Data Modeling" focuses on aggregation and counting, emphasizing that these are not merely post-model calculations but fundamental constraints built into the data model itself. The author highlights the pitfalls of relying on averages, citing examples like the U.S. Air Force's "average pilot" cockpit design that fit no one, and the ambiguity of simple counts like "active users" without precise definitions of identity, existence, discreteness, and context. The chapter explains that aggregation compresses detail for simplicity and speed, a trade-off that must reveal signal, not destroy it. It covers how different domains, from machine learning to streaming data, employ various compression techniques, and introduces structural principles for safe aggregation, particularly stressing the importance of identifying and aligning data grain to prevent issues like double counting and ambiguous interpretations.
Key takeaway
For Data Engineers and Data Scientists designing data models, recognize that aggregation is a core structural element, not an afterthought. You must explicitly define data grain and ensure its integrity throughout your model to prevent misleading metrics and ensure trustworthy, reproducible results. Prioritize clear definitions for counts and aggregations to avoid common pitfalls like double counting or ambiguous interpretations.
Key insights
Effective data modeling requires building aggregation constraints directly into the model, not just applying them afterward.
Principles
- Aggregation is a model constraint, not just a calculation.
- Data must share the same grain for successful aggregation.
- Grain changes must be explicit, never accidental.
Method
Before counting, define what is being counted, establish existence and cardinality, determine discreteness, and account for context and scope to ensure meaningful results.
In practice
- Define "active user" precisely across teams.
- Document data grain for every dataset.
- Use COUNT(DISTINCT user_id) for grain changes.
Topics
- Data Aggregation
- Data Counting
- Data Grain
- Safe Aggregation Principles
- Disjoint Grouping
Best for: Data Engineer, Data Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Practical Data Modeling.