Finding Bad Labels for Text Classification with Jupyter and Prodigy

· Source: Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, extended

Summary

This analysis demonstrates techniques for identifying mislabeled examples in text classification training data, specifically using the Google Emotions dataset of Reddit text. Focusing on the "excitement" label, the process begins with heuristics, such as searching for "excit" substrings in texts lacking the excitement label, which identified 104 suspicious examples. Subsequently, machine learning models, including a scikit-learn bag-of-words logistic regression, were trained to predict labels, revealing approximately 4,000 disagreements (7% of the data) between model predictions and existing labels. Further refinement involved using pre-trained word embeddings like byte-pair and Universal Sentence Encoder to create diverse models and identify discrepancies through confidence-based sorting and inter-model disagreement. The identified suspicious labels were then reviewed and relabeled using Prodigy, highlighting challenges such as context ambiguity, internet slang, and the mental toll of vulgar content.

Key takeaway

For Machine Learning Engineers building text classification systems, regularly auditing your training data for mislabels is critical. You should combine heuristic checks, confidence-based sorting from simple ML models, and inter-model disagreement using diverse embeddings to prioritize suspicious examples. This systematic approach, coupled with careful relabeling using tools like Prodigy, ensures higher data quality, leading to more robust models and trustworthy performance metrics. Documenting labeling guidelines and hard cases will also streamline team onboarding and maintain consistency.

Key insights

Combining heuristics, ML models, and diverse embeddings effectively identifies bad labels in text datasets.

Principles

Method

Identify bad labels by combining heuristics, ML model predictions (low confidence on correct class, high confidence on wrong class, or split confidence), and inter-model disagreement using diverse embeddings. Prioritize examples with multiple "reasons to doubt."

In practice

Topics

Best for: Machine Learning Engineer, Data Scientist, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai.