Finding Duplicates in Tabular Data with Jupyter and Prodigy

2023-04-12 · Source: Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Software Development & Engineering · Depth: Intermediate, extended

Summary

This content details a pragmatic approach to data deduplication using the record linkage Python library and the Prodigy annotation tool. It addresses the common problem of "near duplicates" in tabular data, arising from typos or inconsistencies. The article highlights the computational challenge of comparing all possible pairs in large datasets, like 12,497,500 combinations for 5,000 rows. The record linkage library reduces the candidate pool significantly through "blocking" (e.g., by given_name), cutting 5,000 rows to 55,000 candidate pairs. Comparison rules (e.g., string similarity for surname and address, exact match for state) further reduce candidates to 822 pairs for human review. The article demonstrates customizing Prodigy with a Python "recipe" to create an interactive labeling interface. This interface renders candidate pairs as HTML tables, allowing classification as "Duplicate ✅," "Unique ❌," or "Double Check 🧐." Efficiency is improved with features like choice_auto_accept and visual highlighting of differences.

Key takeaway

For Data Scientists or ML Engineers tackling complex data deduplication, consider integrating human-in-the-loop annotation early. Use libraries like record linkage to pragmatically reduce millions of potential duplicate pairs to a manageable subset. Then, customize annotation tools like Prodigy with Python "recipes" to create an efficient, visually-aided labeling experience. This ensures higher data quality for "near duplicates" that rule-based systems might miss. It also allows for iterative improvements to your labeling workflow.

Key insights

Effective data deduplication for "near duplicates" requires a human-in-the-loop, supported by tools that reduce candidate pairs and customize the labeling experience.

Principles

Human-in-the-loop is crucial for complex deduplication.
Explore data domain before solution design.
Iterate on labeling experience for quality and speed.

Method

Use record linkage for blocking and comparison rules to reduce candidate pairs. Customize Prodigy with a Python recipe to render and label these pairs efficiently via a web interface, iterating on the labeling experience.

In practice

Use record linkage for initial candidate reduction.
Customize Prodigy recipes for specific labeling tasks.
Highlight differences visually in labeling interfaces.

Topics

Data Deduplication
Human-in-the-Loop AI
Prodigy
Record Linkage
Data Quality Management
Custom Annotation Recipes

Best for: Machine Learning Engineer, Data Scientist, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai.