Mark Neumann: ScispaCy: A spaCy pipeline & models for scientific & biomedical text
Summary
SciSpacy is a spaCy package specifically designed for processing scientific and biomedical text, developed at the Allen Institute. It addresses the limitations of general-purpose spaCy models when applied to specialized domains, which often result in incorrect part-of-speech tagging, dependency parsing, and irrelevant named entity recognition (NER) labels. SciSpacy leverages extensive biomedical resources, including ontologies, annotated corpora like MedMentions and CRAFT, and the raw PubMed corpus, to train its models. The core pipeline includes universal dependency parsing, a generic mention detector, and specialized NER models for subfields such as cancer genomics. Notably, SciSpacy offers a significant speed advantage, being approximately 30 times faster than traditional tools like MetaMap, and incorporates custom components for tasks like unsupervised abbreviation detection (achieving 95% precision and 82% recall) and efficient candidate generation for entity linking.
Key takeaway
For NLP Engineers or Research Scientists processing biomedical text, general-purpose spaCy models are insufficient due to domain-specific linguistic challenges. You should adopt SciSpacy for its specialized models, 30x faster performance than tools like MetaMap, and custom components for tasks like abbreviation detection and entity linking. This allows you to build robust, efficient systems without extensive infrastructure work, accelerating your research and application development.
Key insights
SciSpacy provides a fast, domain-specific NLP pipeline for biomedical text, overcoming general spaCy limitations with specialized models and components.
Principles
- Domain-specific NLP requires tailored models.
- Iteration speed is more valuable than thought.
- Simple heuristics can outperform complex ML.
Method
SciSpacy builds on spaCy's architecture, integrating universal dependencies, a generic mention detector, and specialized NER. It uses TF-IDF for entity linking candidate generation and an unsupervised abbreviation detection algorithm.
In practice
- Use SciSpacy for biomedical text processing.
- Develop custom components for unique domain needs.
- Combine heuristic and ML methods.
Topics
- SciSpacy
- Biomedical NLP
- spaCy Pipelines
- Entity Linking
- Abbreviation Detection
- Information Extraction
Best for: AI Engineer, AI Scientist, NLP Engineer, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai.