SpanCat with spaCy and Prodigy on real data

2023-06-02 · Source: Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Novice, long

Summary

A project is underway to extract nested citation data from messy OCR outputs of 1980s-1990s newspapers using spaCy and Prodigy. The objective is to identify citations and segment components like publication names and years to track sources over two decades. The data, intentionally sourced from free Google Drive OCR for its imperfections, necessitates a robust model. The approach utilizes spaCy's SpanCat (span classification) because citations exhibit a hierarchical structure (parenthetical citation > individual sources > publication name/date) that standard Named Entity Recognition cannot fully address. The initial phase involves cultivating training data: loading raw OCR text, splitting it into sections, filtering for 240 examples containing an open parenthesis (including negative examples), and saving them as "focus_input.jsonl" using srsly.write_jsonl for subsequent annotation in Prodigy.

Key takeaway

For NLP Engineers building information extraction models from noisy, real-world text, you should consider spaCy's SpanCat for nested entity recognition. This approach effectively handles hierarchical data, like citations within citations, where traditional NER falls short. Intentionally cultivate training data from messy sources, such as free OCR, and include negative examples to build robust models. This strategy ensures your solution performs reliably on imperfect inputs, minimizing extensive pre-processing requirements.

Key insights

Span classification with spaCy and Prodigy enables robust extraction of nested entities from messy OCR data, crucial for complex information retrieval.

Principles

Span classification handles nested entities.
Incorporate negative examples for robust training.
Real-world OCR data benefits from messy training.

Method

Prepare training data by loading raw OCR text, splitting into lines, filtering for sections containing an open parenthesis, and saving the resulting dictionaries as JSONL files using srsly.write_jsonl for Prodigy annotation.

In practice

Use Google Drive for free, messy OCR.
Filter text by "(" to find potential citations.
Save data as JSONL via srsly.write_jsonl for Prodigy.

Topics

Span Classification
spaCy
Prodigy
OCR Data
Nested Entity Recognition
Training Data Cultivation

Best for: Machine Learning Engineer, NLP Engineer, AI Student

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai.