Mark Neumann: ScispaCy: A spaCy pipeline & models for scientific & biomedical text

· Source: Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Biomedical Natural Language Processing · Depth: Intermediate, long

Summary

SciSpacy is a spaCy package specifically designed for processing scientific and biomedical text, developed at the Allen Institute. It addresses the limitations of general-purpose spaCy models when applied to specialized domains, which often result in incorrect part-of-speech tagging, dependency parsing, and irrelevant named entity recognition (NER) labels. SciSpacy leverages extensive biomedical resources, including ontologies, annotated corpora like MedMentions and CRAFT, and the raw PubMed corpus, to train its models. The core pipeline includes universal dependency parsing, a generic mention detector, and specialized NER models for subfields such as cancer genomics. Notably, SciSpacy offers a significant speed advantage, being approximately 30 times faster than traditional tools like MetaMap, and incorporates custom components for tasks like unsupervised abbreviation detection (achieving 95% precision and 82% recall) and efficient candidate generation for entity linking.

Key takeaway

For NLP Engineers or Research Scientists processing biomedical text, general-purpose spaCy models are insufficient due to domain-specific linguistic challenges. You should adopt SciSpacy for its specialized models, 30x faster performance than tools like MetaMap, and custom components for tasks like abbreviation detection and entity linking. This allows you to build robust, efficient systems without extensive infrastructure work, accelerating your research and application development.

Key insights

SciSpacy provides a fast, domain-specific NLP pipeline for biomedical text, overcoming general spaCy limitations with specialized models and components.

Principles

Method

SciSpacy builds on spaCy's architecture, integrating universal dependencies, a generic mention detector, and specialized NER. It uses TF-IDF for entity linking candidate generation and an unsupervised abbreviation detection algorithm.

In practice

Topics

Best for: AI Engineer, AI Scientist, NLP Engineer, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai.