Introducing Span Categorization in Prodigy and spaCy

2022-06-22 · Source: Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, long

Summary

Span Categorization (spancat) is a new spaCy component designed to extract and classify long, overlapping text spans, differing from Named Entity Recognition (NER) by offering explicit control over candidate spans, providing confidence scores, and exhibiting less edge sensitivity. It comprises a suggester, which extracts span candidates, and a classifier that predicts label probabilities. For data annotation, Prodigy utilizes recipes like "spans.manual" to label datasets, exemplified by a 25,000-recipe food dataset with "ingredient" and "instruction" labels. Annotation consistency is enhanced via "prodigy.json" for guidelines. The process can be accelerated using rule-based patterns for pre-selection and by training temporary spancat models (achieving F-scores around 0.45) to predict and correct labels with "spans.correct", before exporting data to spaCy format.

Key takeaway

For NLP Engineers or Data Scientists building text extraction systems where Named Entity Recognition (NER) falls short on long, overlapping, or context-sensitive spans, you should explore spaCy's new Span Categorization (spancat). This component provides greater control over span candidates and offers meaningful confidence scores. Utilize Prodigy's "spans.manual" recipe, patterns, and temporary models to efficiently annotate and refine your datasets, ensuring high-quality training data for complex entity extraction.

Key insights

Span Categorization (spancat) in spaCy extracts and classifies long, overlapping text spans, providing explicit control and confidence scores.

Principles

Spancat is superior to NER for complex, overlapping, or context-dependent spans.
Annotation guidelines are crucial for dataset consistency across annotators.
Iterative training of temporary models significantly accelerates data labeling.

Method

Annotate spans using Prodigy's "spans.manual" recipe, define labels and guidelines, then accelerate with rule-based patterns and temporary models via "spans.correct" for iterative refinement.

In practice

Implement custom suggester functions to control span candidate generation.
Embed HTML guidelines in "prodigy.json" for consistent annotation rules.
Utilize temporary spancat models to pre-label data for faster correction cycles.

Topics

Span Categorization
spaCy
Prodigy
Text Annotation
Named Entity Recognition
NLP Components

Best for: Machine Learning Engineer, NLP Engineer, AI Student

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai.