Efficient Information Extraction From Text With spaCy

2023-05-11 · Source: Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Software Development & Engineering · Depth: Intermediate, extended

Summary

Victoria Slocum, a Developer Advocate at Explosion, presented on efficient information extraction using spaCy and Prodigy, focusing on named entity recognition (NER). The presentation highlighted challenges in NLP, particularly inconsistent data annotation, and demonstrated how Explosion's tools address these. Using the MIT restaurant reviews dataset, an initial NER model achieved 76.52% accuracy. After identifying annotation inconsistencies, a Prodigy workflow was used to re-annotate approximately 4,000 examples, leading to a significant improvement. The re-trained NER model alone reached 86.68% accuracy, and when combined with a span ruler component, the accuracy increase improved from an initial 1% to almost 2%. The discussion also touched upon Explosion's new spaCy-LLM repository, which aims to integrate large language models for faster prototyping while ensuring structured, reliable output and evaluation in production.

Key takeaway

For NLP Engineers building production-ready systems, prioritize data quality and consistent annotation. If your models underperform, investigate annotation inconsistencies in your training data using tools like Prodigy. Re-annotating even a portion of your dataset can significantly boost model accuracy, as demonstrated by improving NER from 76.52% to 86.68%. This iterative process ensures your rule-based components also perform optimally, leading to more robust and reliable information extraction.

Key insights

Consistent data annotation significantly improves NLP model accuracy and rule-based system effectiveness.

Principles

ML requires data, knowledge, and iteration.
Input data understanding drives output quality.
LLMs need human oversight and structured frameworks.

Method

Train a baseline NER model, identify annotation inconsistencies, then use Prodigy to re-annotate data by comparing original labels with model predictions, and finally re-train for improved accuracy.

In practice

Use spaCy projects for reproducible NLP.
Employ Prodigy for consistent data annotation.
Combine SpanRuler with NER for fine-tuning.

Topics

spaCy
Prodigy
Named Entity Recognition
Data Annotation
NLP Pipelines
Large Language Models

Best for: Machine Learning Engineer, NLP Engineer, Data Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai.