Introducing Span Categorization in Prodigy and spaCy

· Source: Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, long

Summary

Span Categorization (spancat) is a new spaCy component designed to extract and classify long, overlapping text spans, differing from Named Entity Recognition (NER) by offering explicit control over candidate spans, providing confidence scores, and exhibiting less edge sensitivity. It comprises a suggester, which extracts span candidates, and a classifier that predicts label probabilities. For data annotation, Prodigy utilizes recipes like "spans.manual" to label datasets, exemplified by a 25,000-recipe food dataset with "ingredient" and "instruction" labels. Annotation consistency is enhanced via "prodigy.json" for guidelines. The process can be accelerated using rule-based patterns for pre-selection and by training temporary spancat models (achieving F-scores around 0.45) to predict and correct labels with "spans.correct", before exporting data to spaCy format.

Key takeaway

For NLP Engineers or Data Scientists building text extraction systems where Named Entity Recognition (NER) falls short on long, overlapping, or context-sensitive spans, you should explore spaCy's new Span Categorization (spancat). This component provides greater control over span candidates and offers meaningful confidence scores. Utilize Prodigy's "spans.manual" recipe, patterns, and temporary models to efficiently annotate and refine your datasets, ensuring high-quality training data for complex entity extraction.

Key insights

Span Categorization (spancat) in spaCy extracts and classifies long, overlapping text spans, providing explicit control and confidence scores.

Principles

Method

Annotate spans using Prodigy's "spans.manual" recipe, define labels and guidelines, then accelerate with rule-based patterns and temporary models via "spans.correct" for iterative refinement.

In practice

Topics

Best for: Machine Learning Engineer, NLP Engineer, AI Student

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai.