Finding Duplicates in Tabular Data with Jupyter and Prodigy
Summary
This content details a pragmatic approach to data deduplication using the record linkage Python library and the Prodigy annotation tool. It addresses the common problem of "near duplicates" in tabular data, arising from typos or inconsistencies. The article highlights the computational challenge of comparing all possible pairs in large datasets, like 12,497,500 combinations for 5,000 rows. The record linkage library reduces the candidate pool significantly through "blocking" (e.g., by given_name), cutting 5,000 rows to 55,000 candidate pairs. Comparison rules (e.g., string similarity for surname and address, exact match for state) further reduce candidates to 822 pairs for human review. The article demonstrates customizing Prodigy with a Python "recipe" to create an interactive labeling interface. This interface renders candidate pairs as HTML tables, allowing classification as "Duplicate โ ," "Unique โ," or "Double Check ๐ง." Efficiency is improved with features like choice_auto_accept and visual highlighting of differences.
Key takeaway
For Data Scientists or ML Engineers tackling complex data deduplication, consider integrating human-in-the-loop annotation early. Use libraries like record linkage to pragmatically reduce millions of potential duplicate pairs to a manageable subset. Then, customize annotation tools like Prodigy with Python "recipes" to create an efficient, visually-aided labeling experience. This ensures higher data quality for "near duplicates" that rule-based systems might miss. It also allows for iterative improvements to your labeling workflow.
Key insights
Effective data deduplication for "near duplicates" requires a human-in-the-loop, supported by tools that reduce candidate pairs and customize the labeling experience.
Principles
- Human-in-the-loop is crucial for complex deduplication.
- Explore data domain before solution design.
- Iterate on labeling experience for quality and speed.
Method
Use record linkage for blocking and comparison rules to reduce candidate pairs. Customize Prodigy with a Python recipe to render and label these pairs efficiently via a web interface, iterating on the labeling experience.
In practice
- Use record linkage for initial candidate reduction.
- Customize Prodigy recipes for specific labeling tasks.
- Highlight differences visually in labeling interfaces.
Topics
- Data Deduplication
- Human-in-the-Loop AI
- Prodigy
- Record Linkage
- Data Quality Management
- Custom Annotation Recipes
Best for: Machine Learning Engineer, Data Scientist, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Explosion ยท Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai.