Ch 5 - Entities, Instances, and Identifiers
Summary
This chapter introduces the fundamental concept of "entities" in data modeling, defining them as concrete or abstract "things" that are meaningful enough to be represented in a data model, such as Customer, Product, or Order. It emphasizes that correctly identifying entities is crucial for building effective data models, contrasting this with chaotic, error-prone systems like the "Master_Tracker_Final_v7.xlsx" spreadsheet. The content explores how entities are discovered and represented across five data paradigms: Relational (tables), Analytics (dimension tables), Application (JSON documents), ML/AI (feature vectors), and Knowledge (type nodes). It further details entity representation in semi-structured data (JSON objects) and unstructured data (text, images, video, audio) through techniques like Named Entity Recognition (NER). Finally, it discusses entities in metadata, shifting the focus from modeling business operations to modeling the data system itself, with entities like PipelineRun or DataQualityScore.
Key takeaway
For Data Engineers and Architects designing new systems or refactoring existing ones, prioritize explicit entity identification early in the process. Failing to define clear entities leads to chaotic, unmanageable data structures like monolithic tables or giant JSON blobs. Focus on how entities are represented across different data paradigms to ensure consistency and interoperability, especially when integrating diverse data sources and analytical tools.
Key insights
Distinctly naming and identifying entities is the foundational principle of effective data modeling across all data forms.
Principles
- If you can't name it, you can't model it.
- Entities transform across data paradigms.
- Metadata entities model the data system itself.
Method
Identify entities by examining workflows and considering nouns (subjects/objects) involved. Then, represent these entities according to the specific data paradigm (Relational, Analytics, Application, ML/AI, Knowledge, Semi-structured, Unstructured, Metadata).
In practice
- Use NER to extract entities from unstructured text.
- Model metadata to track data governance and quality.
- Ensure unique identifiers for all entity instances.
Topics
- Data Modeling
- Entity Identification
- Mixed Model Architecture
- Data Forms
- Named Entity Recognition
Best for: Data Engineer, MLOps Engineer, Data Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Practical Data Modeling.