Data Quality Diagnostics

· Source: Data Engineering on Medium · Field: Technology & Digital — Data Science & Analytics, Artificial Intelligence & Machine Learning · Depth: Intermediate, medium

Summary

This article introduces a diagnostic framework for identifying subtle data quality issues that evade standard profiling tools like `.describe()`. It emphasizes that data often appears "plausible" rather than overtly "dirty," leading to silent failures in data pipelines and models. The framework is structured around five key questions: "Does the data match its contract?" (verifying structure, types, and column names), "Is the absence informative?" (analyzing patterns and correlations of missing data), "Are your identifiers trustworthy?" (detecting exact and partial duplicates, and assessing cardinality), "Do the numbers make domain sense?" (identifying outliers and distributional anomalies with domain context), and "Do the fields tell the same story?" (uncovering logical contradictions and referential integrity issues). The article provides minimal `pandas` and `numpy` code snippets to illustrate each diagnostic question, stressing that the quality of the question is more important than the sophistication of the tool.

Key takeaway

For Data Scientists and Machine Learning Engineers building robust data pipelines, you should move beyond basic data profiling. Actively ask diagnostic questions about your data's contract, missingness patterns, identifier integrity, numerical domain sense, and cross-field consistency. This proactive, question-driven approach will uncover silent data quality issues before they compromise model accuracy or downstream analysis, saving significant debugging time.

Key insights

Subtle data quality issues often manifest as "plausible" data, requiring diagnostic questions beyond basic profiling.

Principles

Method

A diagnostic approach to data quality involves asking five key questions: contract adherence, informative absence, identifier trustworthiness, domain sense of numbers, and cross-field consistency, using basic Python tools.

In practice

Topics

Best for: Data Scientist, Data Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Data Engineering on Medium.