AAbAAC: An Annotated Corpus for Autoimmunity Information Extraction

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Biomedical Natural Language Processing · Depth: Advanced, extended

Summary

AAbAAC, an AutoAntibodies and Autoimmunity Annotated Corpus, is introduced to enhance information extraction in the specialized biomedical field of autoimmunity. Comprising 115 PubMed abstracts, this corpus was manually annotated for five entity types—"Autoantibody", "Autoantibody target", "Disease", "Symptom or clinical sign", and "Autoantibody location"—along with ten relationship types. The corpus creation involved building a dictionary of 10,916 autoantibody variations from HPO release 2024-07-01, pre-annotating texts with GLiNER, and then manual annotation by four experts using Doccano. Evaluation of named entity recognition (NER) methods, including QuickUMLS, GLiNER, and MedGemma, demonstrated that fine-tuning models with AAbAAC significantly improved performance. For instance, fine-tuned MedGemma achieved the best overall F1-score, and GLiNER large fine-tuned reached an F1-score of 0.79 for "Disease" detection, validating the corpus's utility for specialized domain NLP tasks.

Key takeaway

For machine learning engineers developing NLP solutions in specialized biomedical fields, you should prioritize creating or acquiring small, high-quality annotated corpora. Fine-tuning models like GLiNER or MedGemma with such domain-specific data, even just 115 abstracts, demonstrably improves named entity recognition performance over zero-shot or dictionary-based methods. This approach is critical for accurately extracting complex entities like autoantibodies and diseases. It ensures more precise and reliable information extraction for your applications.

Key insights

Small, specialized annotated corpora significantly boost information extraction performance in niche biomedical domains.

Principles

Method

Create a domain-specific dictionary, query literature, pre-annotate with generalist models, then manually annotate entities and relationships with expert adjudicators.

In practice

Topics

Code references

Best for: NLP Engineer, AI Scientist, Research Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.