I Built a Data Quality Tool Because I Was Tired of Debugging Bad Data
Summary
DataProfi is an open-source, deterministic, rule-based data quality tool designed to quickly identify and explain issues in tabular datasets, particularly for machine learning pipelines and database design. Developed to address common problems like silent data format changes causing model degradation, DataProfi scores data quality across five dimensions—completeness, consistency, uniqueness, validity, and timeliness—aligned with ISO 25012 and DAMA DMBOK frameworks. It classifies columns by semantic role to provide context-aware issue flagging, such as distinguishing critical nulls in ID columns from expected nulls in free-text fields. The tool also features a schema recommender that generates production-ready PostgreSQL "CREATE TABLE" statements from CSVs, mapping pandas dtypes to optimal SQL types and adding constraints. Additionally, DataProfi offers a web dashboard with eleven analysis views, including a cleaning pipeline with before/after previews. It is available under the Apache-2.0 license.
Key takeaway
For ML engineers and data scientists struggling with data quality issues in production, DataProfi offers a rapid, opinionated solution to proactively identify and understand data problems. You should integrate this open-source tool into your initial data exploration workflow to quickly assess dataset readiness and generate robust database schemas. This approach minimizes debugging time and prevents silent model degradation by providing actionable context for every detected issue.
Key insights
Fast, opinionated first-contact profiling for tabular data prevents downstream ML issues.
Principles
- Data quality tools should explain "why," not just "what."
- Context-aware scoring improves issue relevance.
- Deterministic rules ensure trust and auditability.
Method
DataProfi analyzes DataFrames by classifying column roles, then scores data across five ISO 25012 dimensions, surfacing issues with contextual explanations and suggesting fixes.
In practice
- Profile new datasets before model training.
- Generate PostgreSQL DDL from CSVs.
- Use cleaning pipelines with before/after previews.
Topics
- Data Quality
- ML Pipelines
- Data Profiling
- PostgreSQL Schema Generation
- Tabular Data
- Apache-2.0
Code references
Best for: MLOps Engineer, AI Engineer, Machine Learning Engineer, Data Scientist, Data Analyst
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Data Science on Medium.