AAbAAC: An Annotated Corpus for Autoimmunity Information Extraction
Summary
AAbAAC, an AutoAntibodies and Autoimmunity Annotated Corpus, is introduced to enhance information extraction in the specialized biomedical field of autoimmunity. Comprising 115 PubMed abstracts, this corpus was manually annotated for five entity types—"Autoantibody", "Autoantibody target", "Disease", "Symptom or clinical sign", and "Autoantibody location"—along with ten relationship types. The corpus creation involved building a dictionary of 10,916 autoantibody variations from HPO release 2024-07-01, pre-annotating texts with GLiNER, and then manual annotation by four experts using Doccano. Evaluation of named entity recognition (NER) methods, including QuickUMLS, GLiNER, and MedGemma, demonstrated that fine-tuning models with AAbAAC significantly improved performance. For instance, fine-tuned MedGemma achieved the best overall F1-score, and GLiNER large fine-tuned reached an F1-score of 0.79 for "Disease" detection, validating the corpus's utility for specialized domain NLP tasks.
Key takeaway
For machine learning engineers developing NLP solutions in specialized biomedical fields, you should prioritize creating or acquiring small, high-quality annotated corpora. Fine-tuning models like GLiNER or MedGemma with such domain-specific data, even just 115 abstracts, demonstrably improves named entity recognition performance over zero-shot or dictionary-based methods. This approach is critical for accurately extracting complex entities like autoantibodies and diseases. It ensures more precise and reliable information extraction for your applications.
Key insights
Small, specialized annotated corpora significantly boost information extraction performance in niche biomedical domains.
Principles
- Domain-specific corpora improve generalist model performance.
- Manual annotation is crucial for specialized entity recognition.
- Inter-annotator agreement highlights annotation complexity.
Method
Create a domain-specific dictionary, query literature, pre-annotate with generalist models, then manually annotate entities and relationships with expert adjudicators.
In practice
- Fine-tune GLiNER or MedGemma models with domain data.
- Prioritize "Disease" and "Autoantibody" entity types for higher F1-scores.
- Consider removing rarely used relation types from annotation schemes.
Topics
- Autoimmunity
- Named Entity Recognition
- Biomedical NLP
- Annotated Corpora
- GLiNER
- MedGemma
Code references
Best for: NLP Engineer, AI Scientist, Research Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.