Finding Bad Labels for Text Classification with Jupyter and Prodigy
Summary
This analysis demonstrates techniques for identifying mislabeled examples in text classification training data, specifically using the Google Emotions dataset of Reddit text. Focusing on the "excitement" label, the process begins with heuristics, such as searching for "excit" substrings in texts lacking the excitement label, which identified 104 suspicious examples. Subsequently, machine learning models, including a scikit-learn bag-of-words logistic regression, were trained to predict labels, revealing approximately 4,000 disagreements (7% of the data) between model predictions and existing labels. Further refinement involved using pre-trained word embeddings like byte-pair and Universal Sentence Encoder to create diverse models and identify discrepancies through confidence-based sorting and inter-model disagreement. The identified suspicious labels were then reviewed and relabeled using Prodigy, highlighting challenges such as context ambiguity, internet slang, and the mental toll of vulgar content.
Key takeaway
For Machine Learning Engineers building text classification systems, regularly auditing your training data for mislabels is critical. You should combine heuristic checks, confidence-based sorting from simple ML models, and inter-model disagreement using diverse embeddings to prioritize suspicious examples. This systematic approach, coupled with careful relabeling using tools like Prodigy, ensures higher data quality, leading to more robust models and trustworthy performance metrics. Documenting labeling guidelines and hard cases will also streamline team onboarding and maintain consistency.
Key insights
Combining heuristics, ML models, and diverse embeddings effectively identifies bad labels in text datasets.
Principles
- Trustworthy labels are crucial for reliable ML metrics.
- Data documentation (e.g., research papers) informs error expectations.
- Annotator disagreement data is a pragmatic starting point.
Method
Identify bad labels by combining heuristics, ML model predictions (low confidence on correct class, high confidence on wrong class, or split confidence), and inter-model disagreement using diverse embeddings. Prioritize examples with multiple "reasons to doubt."
In practice
- Use substring heuristics for initial sanity checks.
- Train simple ML models to find label-model disagreements.
- Compare models with different embeddings to find discrepancies.
Topics
- Text Classification
- Data Labeling Quality
- Mislabeled Data Detection
- Machine Learning Models
- Word Embeddings
- Prodigy Annotation Tool
Best for: Machine Learning Engineer, Data Scientist, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai.