The Semantic Medallion: Building a Knowledge Graph-Powered Data Catalog
Summary
Veronika Heimsbakk, a Knowledge Graph specialist at Data Treehouse, proposes an enhancement to the traditional medallion architecture for data lakehouses, transforming the Gold layer into a knowledge graph. This approach addresses the limitations of conventional data catalogs, which struggle to connect disparate data sources semantically. The updated architecture maintains Bronze for raw data and Silver for structured data, but crucially introduces Internationalized Resource Identifiers (IRIs) in the Silver layer to create stable, globally unique identifiers for entities. The Gold layer then harmonizes these Silver DataFrames using a shared ontology and publishes them as RDF, enabling semantic querying and entity resolution across multiple systems. This method leverages the W3C standard Data Catalog Vocabulary (DCAT) to describe datasets and integrate metadata directly into the knowledge graph, facilitating interoperability, rich metadata, and built-in provenance.
Key takeaway
For AI Architects and Data Engineers building data catalogs, consider evolving your medallion architecture's Gold layer into a knowledge graph. By implementing IRIs in the Silver layer and publishing harmonized data as RDF using a shared ontology and DCAT, you can achieve semantic search, entity resolution, and impact analysis capabilities that traditional catalogs lack. This approach unifies metadata and data within the same graph, significantly enhancing data discoverability and interoperability for your organization.
Key insights
Transforming the Gold layer of a medallion architecture into a knowledge graph enables semantic data catalogs.
Principles
- IRIs are crucial for cross-system entity identification.
- Ontology design is an iterative process.
- Medallion architecture can be extended for semantic enrichment.
Method
The method involves minting IRIs in the Silver layer, mapping Silver DataFrames to a shared ontology, and publishing as RDF in the Gold layer, using tools like maplib and standards like DCAT.
In practice
- Use maplib for DataFrame-to-RDF transformation.
- Start with two overlapping data sources.
- Prioritize stable IRI creation early in the process.
Topics
- Semantic Medallion Architecture
- Knowledge Graphs
- Data Catalogs
- RDF
- IRIs
Best for: Data Engineer, AI Architect, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Modern Data 101.