Data Quality Diagnostics
Summary
This article introduces a diagnostic framework for identifying subtle data quality issues that evade standard profiling tools like `.describe()`. It emphasizes that data often appears "plausible" rather than overtly "dirty," leading to silent failures in data pipelines and models. The framework is structured around five key questions: "Does the data match its contract?" (verifying structure, types, and column names), "Is the absence informative?" (analyzing patterns and correlations of missing data), "Are your identifiers trustworthy?" (detecting exact and partial duplicates, and assessing cardinality), "Do the numbers make domain sense?" (identifying outliers and distributional anomalies with domain context), and "Do the fields tell the same story?" (uncovering logical contradictions and referential integrity issues). The article provides minimal `pandas` and `numpy` code snippets to illustrate each diagnostic question, stressing that the quality of the question is more important than the sophistication of the tool.
Key takeaway
For Data Scientists and Machine Learning Engineers building robust data pipelines, you should move beyond basic data profiling. Actively ask diagnostic questions about your data's contract, missingness patterns, identifier integrity, numerical domain sense, and cross-field consistency. This proactive, question-driven approach will uncover silent data quality issues before they compromise model accuracy or downstream analysis, saving significant debugging time.
Key insights
Subtle data quality issues often manifest as "plausible" data, requiring diagnostic questions beyond basic profiling.
Principles
- Data quality is a continuous monitoring process, not a one-time check.
- The meaning of data anomalies depends on domain knowledge.
- Assertions are hypotheses about data, not just safeguards.
Method
A diagnostic approach to data quality involves asking five key questions: contract adherence, informative absence, identifier trustworthiness, domain sense of numbers, and cross-field consistency, using basic Python tools.
In practice
- Use `df.select_dtypes(include="object").apply(lambda col: pd.to_numeric(col, errors="coerce").notna().sum())` to find numeric strings.
- Visualize missingness patterns with `df.isnull().T` to reveal structural issues.
- Assert `df["id"].is_unique` to explicitly test identifier uniqueness.
Topics
- Data Quality
- Data Diagnostics
- Data Validation
- Missing Data Analysis
- Data Profiling
Best for: Data Scientist, Data Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Data Engineering on Medium.