Finding Bad Labels for Text Classification with Jupyter and Prodigy

2023-06-02 · Source: Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, extended

Summary

This analysis demonstrates techniques for identifying mislabeled examples in text classification training data, specifically using the Google Emotions dataset of Reddit text. Focusing on the "excitement" label, the process begins with heuristics, such as searching for "excit" substrings in texts lacking the excitement label, which identified 104 suspicious examples. Subsequently, machine learning models, including a scikit-learn bag-of-words logistic regression, were trained to predict labels, revealing approximately 4,000 disagreements (7% of the data) between model predictions and existing labels. Further refinement involved using pre-trained word embeddings like byte-pair and Universal Sentence Encoder to create diverse models and identify discrepancies through confidence-based sorting and inter-model disagreement. The identified suspicious labels were then reviewed and relabeled using Prodigy, highlighting challenges such as context ambiguity, internet slang, and the mental toll of vulgar content.

Key takeaway

For Machine Learning Engineers building text classification systems, regularly auditing your training data for mislabels is critical. You should combine heuristic checks, confidence-based sorting from simple ML models, and inter-model disagreement using diverse embeddings to prioritize suspicious examples. This systematic approach, coupled with careful relabeling using tools like Prodigy, ensures higher data quality, leading to more robust models and trustworthy performance metrics. Documenting labeling guidelines and hard cases will also streamline team onboarding and maintain consistency.

Key insights

Combining heuristics, ML models, and diverse embeddings effectively identifies bad labels in text datasets.

Principles

Trustworthy labels are crucial for reliable ML metrics.
Data documentation (e.g., research papers) informs error expectations.
Annotator disagreement data is a pragmatic starting point.

Method

Identify bad labels by combining heuristics, ML model predictions (low confidence on correct class, high confidence on wrong class, or split confidence), and inter-model disagreement using diverse embeddings. Prioritize examples with multiple "reasons to doubt."

In practice

Use substring heuristics for initial sanity checks.
Train simple ML models to find label-model disagreements.
Compare models with different embeddings to find discrepancies.

Topics

Text Classification
Data Labeling Quality
Mislabeled Data Detection
Machine Learning Models
Word Embeddings
Prodigy Annotation Tool

Best for: Machine Learning Engineer, Data Scientist, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai.