I Built a Data Quality Tool Because I Was Tired of Debugging Bad Data

2026-05-31 · Source: Data Science on Medium · Field: Technology & Digital — Data Science & Analytics, Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, medium

Summary

DataProfi is an open-source, deterministic, rule-based data quality tool designed to quickly identify and explain issues in tabular datasets, particularly for machine learning pipelines and database design. Developed to address common problems like silent data format changes causing model degradation, DataProfi scores data quality across five dimensions—completeness, consistency, uniqueness, validity, and timeliness—aligned with ISO 25012 and DAMA DMBOK frameworks. It classifies columns by semantic role to provide context-aware issue flagging, such as distinguishing critical nulls in ID columns from expected nulls in free-text fields. The tool also features a schema recommender that generates production-ready PostgreSQL "CREATE TABLE" statements from CSVs, mapping pandas dtypes to optimal SQL types and adding constraints. Additionally, DataProfi offers a web dashboard with eleven analysis views, including a cleaning pipeline with before/after previews. It is available under the Apache-2.0 license.

Key takeaway

For ML engineers and data scientists struggling with data quality issues in production, DataProfi offers a rapid, opinionated solution to proactively identify and understand data problems. You should integrate this open-source tool into your initial data exploration workflow to quickly assess dataset readiness and generate robust database schemas. This approach minimizes debugging time and prevents silent model degradation by providing actionable context for every detected issue.

Key insights

Fast, opinionated first-contact profiling for tabular data prevents downstream ML issues.

Principles

Data quality tools should explain "why," not just "what."
Context-aware scoring improves issue relevance.
Deterministic rules ensure trust and auditability.

Method

DataProfi analyzes DataFrames by classifying column roles, then scores data across five ISO 25012 dimensions, surfacing issues with contextual explanations and suggesting fixes.

In practice

Profile new datasets before model training.
Generate PostgreSQL DDL from CSVs.
Use cleaning pipelines with before/after previews.

Topics

Data Quality
ML Pipelines
Data Profiling
PostgreSQL Schema Generation
Tabular Data
Apache-2.0

Code references

AndreaEr/dataprofi

Best for: MLOps Engineer, AI Engineer, Machine Learning Engineer, Data Scientist, Data Analyst

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Data Science on Medium.