AAbAAC: An Annotated Corpus for Autoimmunity Information Extraction

2024-07-01 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Biomedical Natural Language Processing · Depth: Advanced, extended

Summary

AAbAAC, an AutoAntibodies and Autoimmunity Annotated Corpus, is introduced to enhance information extraction in the specialized biomedical field of autoimmunity. Comprising 115 PubMed abstracts, this corpus was manually annotated for five entity types—"Autoantibody", "Autoantibody target", "Disease", "Symptom or clinical sign", and "Autoantibody location"—along with ten relationship types. The corpus creation involved building a dictionary of 10,916 autoantibody variations from HPO release 2024-07-01, pre-annotating texts with GLiNER, and then manual annotation by four experts using Doccano. Evaluation of named entity recognition (NER) methods, including QuickUMLS, GLiNER, and MedGemma, demonstrated that fine-tuning models with AAbAAC significantly improved performance. For instance, fine-tuned MedGemma achieved the best overall F1-score, and GLiNER large fine-tuned reached an F1-score of 0.79 for "Disease" detection, validating the corpus's utility for specialized domain NLP tasks.

Key takeaway

For machine learning engineers developing NLP solutions in specialized biomedical fields, you should prioritize creating or acquiring small, high-quality annotated corpora. Fine-tuning models like GLiNER or MedGemma with such domain-specific data, even just 115 abstracts, demonstrably improves named entity recognition performance over zero-shot or dictionary-based methods. This approach is critical for accurately extracting complex entities like autoantibodies and diseases. It ensures more precise and reliable information extraction for your applications.

Key insights

Small, specialized annotated corpora significantly boost information extraction performance in niche biomedical domains.

Principles

Domain-specific corpora improve generalist model performance.
Manual annotation is crucial for specialized entity recognition.
Inter-annotator agreement highlights annotation complexity.

Method

Create a domain-specific dictionary, query literature, pre-annotate with generalist models, then manually annotate entities and relationships with expert adjudicators.

In practice

Fine-tune GLiNER or MedGemma models with domain data.
Prioritize "Disease" and "Autoantibody" entity types for higher F1-scores.
Consider removing rarely used relation types from annotation schemes.

Topics

Autoimmunity
Named Entity Recognition
Biomedical NLP
Annotated Corpora
GLiNER
MedGemma

Code references

Best for: NLP Engineer, AI Scientist, Research Scientist, Machine Learning Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.