SpanCat with spaCy and Prodigy on real data
Summary
A project is underway to extract nested citation data from messy OCR outputs of 1980s-1990s newspapers using spaCy and Prodigy. The objective is to identify citations and segment components like publication names and years to track sources over two decades. The data, intentionally sourced from free Google Drive OCR for its imperfections, necessitates a robust model. The approach utilizes spaCy's SpanCat (span classification) because citations exhibit a hierarchical structure (parenthetical citation > individual sources > publication name/date) that standard Named Entity Recognition cannot fully address. The initial phase involves cultivating training data: loading raw OCR text, splitting it into sections, filtering for 240 examples containing an open parenthesis (including negative examples), and saving them as "focus_input.jsonl" using srsly.write_jsonl for subsequent annotation in Prodigy.
Key takeaway
For NLP Engineers building information extraction models from noisy, real-world text, you should consider spaCy's SpanCat for nested entity recognition. This approach effectively handles hierarchical data, like citations within citations, where traditional NER falls short. Intentionally cultivate training data from messy sources, such as free OCR, and include negative examples to build robust models. This strategy ensures your solution performs reliably on imperfect inputs, minimizing extensive pre-processing requirements.
Key insights
Span classification with spaCy and Prodigy enables robust extraction of nested entities from messy OCR data, crucial for complex information retrieval.
Principles
- Span classification handles nested entities.
- Incorporate negative examples for robust training.
- Real-world OCR data benefits from messy training.
Method
Prepare training data by loading raw OCR text, splitting into lines, filtering for sections containing an open parenthesis, and saving the resulting dictionaries as JSONL files using srsly.write_jsonl for Prodigy annotation.
In practice
- Use Google Drive for free, messy OCR.
- Filter text by "(" to find potential citations.
- Save data as JSONL via srsly.write_jsonl for Prodigy.
Topics
- Span Classification
- spaCy
- Prodigy
- OCR Data
- Nested Entity Recognition
- Training Data Cultivation
Best for: Machine Learning Engineer, NLP Engineer, AI Student
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai.