Group-by statements that save the day

2022-12-15 · Source: Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai · Field: Technology & Digital — Data Science & Analytics, Artificial Intelligence & Machine Learning · Depth: Novice, extended

Summary

This analysis highlights the critical role of fundamental data exploration, particularly SQL "GROUP BY" statements, in uncovering crucial insights often overlooked when rushing to apply advanced machine learning. Using the "Chick Weight" dataset, it demonstrates how grouping by individual chickens reveals premature deaths, a "dead chickens" problem that predictive models might ignore. Similarly, an examination of the "Google Emotions" dataset, comprising nearly 58,000 text examples and over 200,000 annotations by 82 people, shows that only 13% of examples had full annotator agreement across 27 emotion tags. This significant disagreement, discoverable through simple grouping, questions the immediate utility of training complex models without prior data quality checks. The discussion advocates for critical thinking and understanding data provenance, suggesting that basic statistical checks can prevent deploying flawed models, and introduces the "Doubt Lab Library" for identifying bad labels.

Key takeaway

For data scientists or machine learning engineers evaluating new datasets, prioritize fundamental exploratory analysis, especially "GROUP BY" statements, before deploying complex models. You should investigate data provenance and annotator agreement to uncover hidden issues, like incomplete data or label inconsistencies, which can severely compromise model integrity. This critical step ensures your predictions are based on sound data, preventing the deployment of models that might silently misinterpret crucial real-world phenomena.

Key insights

Simple data exploration, like GROUP BY, often reveals critical issues that complex ML models miss, emphasizing human judgment.

Principles

Visualizations surprise; machine learning scales.
Understand data provenance before modeling.
Basic statistical checks prevent flawed predictions.

Method

Before modeling, perform GROUP BY on key identifiers (e.g., individual subjects, text examples) to check for unexpected patterns or annotator disagreement, especially regarding data creation.

In practice

Use GROUP BY to identify "dead chickens" in time-series data.
Check annotator agreement on multi-labeled datasets.
Employ tools like Doubt Lab Library for label quality.

Topics

Data Exploration
SQL Group By
Data Quality
Annotator Disagreement
Machine Learning Pipelines
Doubt Lab Library

Best for: AI Engineer, NLP Engineer, AI Scientist, Data Scientist, Machine Learning Engineer, AI Student

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai.