Bulk Labelling and Prodigy

2022-07-05 · Source: Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Software Development & Engineering · Depth: Intermediate, extended

Summary

Bulk labeling is a technique designed to accelerate data annotation for text classification tasks. It employs pre-trained language models, such as 'all-MiniLM-L6-v2' from SentenceTransformer, to generate high-dimensional text embeddings. These embeddings are then reduced to two dimensions using UMAP for visual representation. An interactive tool, "bulk", enables users to visually select clusters of semantically similar texts on a scatter plot, quickly assigning a label to a subset. This method helps bootstrap annotation by identifying interesting subsets that go beyond simple string matching, which can introduce bias, and prepares data for refinement in annotation tools like Prodigy.

Key takeaway

For Machine Learning Engineers or Data Scientists initiating text classification projects with large unlabeled datasets, consider integrating bulk labeling into your workflow. This technique, using tools like "bulk" and Prodigy, allows you to quickly identify and pre-label relevant text clusters based on semantic similarity, significantly accelerating the initial data annotation phase. While not perfect, it provides valuable subsets for focused manual review, reducing the time spent on irrelevant examples and mitigating bias from simple keyword matching.

Key insights

Bulk labeling uses language model embeddings and dimensionality reduction to visually cluster and rapidly pre-label text subsets for annotation.

Principles

Text embeddings capture contextual similarity.
Visualizing 2D embeddings reveals natural clusters.
Bulk pre-labeling speeds up initial annotation.

Method

Load text, generate high-dimensional embeddings (e.g., SentenceTransformer 'all-MiniLM-L6-v2'), reduce dimensionality to 2D (e.g., UMAP), then use an interactive tool (like "bulk") to visually select and label text clusters for detailed annotation.

In practice

Use `pip install bulk` for the interactive tool.
Employ `SentenceTransformer` for text embeddings.
Apply `UMAP` for 2D dimensionality reduction.

Topics

Text Classification
Data Annotation
Language Models
Text Embeddings
UMAP
Prodigy
Bulk Labeling

Best for: Machine Learning Engineer, Data Scientist, MLOps Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai.