30% of Google's Emotions Dataset is Mislabeled
Summary
Google's "GoEmotions" dataset, a human-labeled collection of 58,000 Reddit comments categorized into 27 emotions, contains a significant 30% mislabeling rate. An independent audit of 1,000 random comments revealed 308 strong errors, impacting the reliability of models trained on this data. Mislabels stem from issues like misunderstanding profanity, English idioms, sarcasm, basic English, US politics/culture, and Reddit memes. Examples include "LETS FUCKING GOOOOO" labeled as ANGER and "Yay, cold McDonald's. My favorite." labeled as LOVE. These errors are attributed to Google's data labeling methodology, which provided labelers with no additional metadata for comments and exclusively used native English speakers from India, who lacked familiarity with US-centric online culture and slang.
Key takeaway
For AI Engineers developing NLP models, your choice of training data directly impacts model accuracy and real-world applicability. You should critically evaluate public datasets for labeling quality and consider investing in robust, context-aware labeling infrastructure with culturally competent annotators to avoid propagating significant errors into your models, which can lead to misclassifications in applications like toxicity detection or spam filtering.
Key insights
Poor data labeling methodology, especially lacking context and cultural fluency, severely compromises dataset quality and model performance.
Principles
- Context is crucial for accurate text annotation.
- Labeler cultural fluency impacts data quality.
- High-quality data is paramount for robust ML models.
Method
Effective data labeling requires providing comprehensive context (e.g., subreddit, parent post), using culturally fluent labelers, and implementing quality control like dynamic exams and AI-human discrepancy checks.
In practice
- Provide full context to human labelers.
- Match labeler demographics to data source.
- Implement rigorous labeler testing.
Topics
- Data Quality
- Emotion Recognition
- Dataset Mislabels
- Natural Language Processing
- Data Labeling
Best for: AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, Data Scientist, AI Researcher
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Surge AI Blog.