Ch 5 - Entities, Instances, and Identifiers

· Source: Practical Data Modeling · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Software Development & Engineering · Depth: Intermediate, medium

Summary

This chapter introduces the fundamental concept of "entities" in data modeling, defining them as concrete or abstract "things" that are meaningful enough to be represented in a data model, such as Customer, Product, or Order. It emphasizes that correctly identifying entities is crucial for building effective data models, contrasting this with chaotic, error-prone systems like the "Master_Tracker_Final_v7.xlsx" spreadsheet. The content explores how entities are discovered and represented across five data paradigms: Relational (tables), Analytics (dimension tables), Application (JSON documents), ML/AI (feature vectors), and Knowledge (type nodes). It further details entity representation in semi-structured data (JSON objects) and unstructured data (text, images, video, audio) through techniques like Named Entity Recognition (NER). Finally, it discusses entities in metadata, shifting the focus from modeling business operations to modeling the data system itself, with entities like PipelineRun or DataQualityScore.

Key takeaway

For Data Engineers and Architects designing new systems or refactoring existing ones, prioritize explicit entity identification early in the process. Failing to define clear entities leads to chaotic, unmanageable data structures like monolithic tables or giant JSON blobs. Focus on how entities are represented across different data paradigms to ensure consistency and interoperability, especially when integrating diverse data sources and analytical tools.

Key insights

Distinctly naming and identifying entities is the foundational principle of effective data modeling across all data forms.

Principles

Method

Identify entities by examining workflows and considering nouns (subjects/objects) involved. Then, represent these entities according to the specific data paradigm (Relational, Analytics, Application, ML/AI, Knowledge, Semi-structured, Unstructured, Metadata).

In practice

Topics

Best for: Data Engineer, MLOps Engineer, Data Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Practical Data Modeling.