Training a custom entity linking model with spaCy
Summary
spaCy's recently implemented Entity Linking functionality allows resolving ambiguous textual mentions to unique concepts within a knowledge base. This tutorial demonstrates training a custom entity linking model from scratch using spaCy. It covers setting up a simple knowledge base with 3 entities and 300-D entity vectors, creating annotated training data using Prodigy from 30 Wikipedia sentences, and then training a new "entity_linker" component over 500 iterations. The process involves Named Entity Recognition, candidate generation from the knowledge base, and final disambiguation. The trained model achieved approximately 83% accuracy on a small, unseen test dataset of 6 sentences, correctly disambiguating "Emerson" mentions.
Key takeaway
For NLP Engineers building custom information extraction systems, understanding spaCy's Entity Linking is crucial for disambiguating entities. You should prioritize building a representative knowledge base and generating high-quality, domain-specific training data. This approach enables your models to accurately link ambiguous mentions to unique identifiers, significantly enhancing downstream tasks like relation extraction or graph construction. Consider using tools like Prodigy to streamline your annotation workflow.
Key insights
Entity Linking resolves ambiguous text mentions to unique knowledge base concepts by leveraging context.
Principles
- Entity Linking pipelines require NER, candidate generation, and disambiguation.
- Knowledge base size must balance recall with practical manageability.
- Annotator feedback is crucial for understanding data complexity.
Method
Implement Entity Linking by defining a knowledge base with entity vectors and aliases, annotating training data (e.g., with Prodigy), and training a spaCy "entity_linker" component.
In practice
- Augment relation extraction with gene normalization for biomedical entities.
- Consolidate company names from news for economic landscape graphs.
- Use Prodigy for rapid iteration between annotation and model training.
Topics
- Entity Linking
- spaCy
- Named Entity Recognition
- Knowledge Base
- Machine Learning
- Data Annotation
- Prodigy
Best for: NLP Engineer, Machine Learning Engineer, AI Student
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai.