Why Every Analytics Engineer Needs to Understand Data Architecture
Summary
This article provides a crash course on six core data architectures, detailing their evolution, strengths, and weaknesses. It begins with relational databases, introduced by Edgar F. Codd in the 1970s, emphasizing their schema-on-write approach for structured data. It then covers relational data warehouses, developed to separate analytical workloads (OLAP) from operational systems (OLTP), discussing the Inmon (top-down) and Kimball (bottom-up) approaches. The article describes data lakes as cheap, schema-on-read storage that initially led to "data swamps" but found utility as staging areas. It introduces data lakehouses, pioneered by Databricks around 2020, which combine data lake flexibility with data warehouse reliability via transactional storage layers like Delta Lake. Finally, it explores data mesh, a sociotechnical shift decentralizing data ownership to domain experts, and event-driven architectures, which enable real-time, loosely coupled system reactions via event brokers like Apache Kafka.
Key takeaway
For Analytics Engineers making daily decisions about data structure, storage, and transformation, understanding these architectural paradigms is crucial. Your choices, from using a view versus a table to placing transformation logic, collectively form the analytics ecosystem's foundation. Evaluate whether a centralized data warehouse, a flexible data lakehouse, or a decentralized data mesh best fits your organization's scale and domain expertise to avoid costly inefficiencies.
Key insights
Effective data architecture is crucial for organizational efficiency, evolving from structured databases to decentralized, real-time systems.
Principles
- Schema-on-write ensures data consistency upfront.
- Schema-on-read offers flexibility but shifts complexity.
- Data mesh requires sociotechnical, not just technical, change.
Method
Data architecture involves defining data location, movement, transformation, and access, akin to city planning for data flow and organization.
In practice
- Use relational databases for OLTP systems requiring fast, consistent operations.
- Implement data warehouses for dedicated analytical workloads.
- Consider data lakehouses for combining raw data storage with structured analysis.
Topics
- Data Architecture
- Data Warehousing
- Data Lakehouse
- Data Mesh
- Event-Driven Architecture
Best for: Data Engineer, Analytics Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.