BioHiCL: Hierarchical Multi-Label Contrastive Learning for Biomedical Retrieval with MeSH Labels
Summary
BioHiCL (Biomedical Retrieval with Hierarchical Multi-Label Contrastive Learning) is a new framework for biomedical information retrieval that leverages hierarchical MeSH annotations to provide structured supervision. It addresses limitations of existing generative retrievers that rely on coarse binary relevance signals, which struggle to capture fine-grained semantic overlap in biomedical texts. BioHiCL adapts general-domain dense retrievers using parameter-efficient LoRA fine-tuning on 80K abstracts from the BioASQ Task 1a benchmark, incorporating depth-aware label similarity. The framework introduces two models, BioHiCL-Base (0.1B parameters) and BioHiCL-Large (0.3B parameters), which demonstrate strong performance across biomedical retrieval, sentence similarity, and question answering tasks, including benchmarks like NFCorpus, BIOSSES, and PubMedQA. These models also maintain computational efficiency, running effectively on a single A100 40GB GPU.
Key takeaway
For NLP Engineers developing biomedical information retrieval systems, BioHiCL offers a robust and efficient solution. You should consider integrating BioHiCL-Base or BioHiCL-Large to capture fine-grained semantic relationships using hierarchical MeSH labels, which can significantly improve performance on tasks like document retrieval, sentence similarity, and question answering while remaining computationally practical for deployment on a single A100 GPU.
Key insights
Hierarchical MeSH labels provide fine-grained supervision for biomedical dense retrieval, improving semantic representation.
Principles
- Align embedding similarity with depth-weighted MeSH label similarity.
- Combine regression and contrastive losses to prevent embedding collapse.
- LoRA fine-tuning efficiently adapts general-domain retrievers.
Method
BioHiCL uses LoRA fine-tuning to align embedding similarity (SimE) with depth-weighted MeSH label similarity (SimL) via a filtered regression objective and a hierarchy-aware contrastive loss, emphasizing specific concepts.
In practice
- Use BioHiCL-Base (0.1B) for efficient, strong biomedical retrieval.
- Consider BioHiCL-Large (0.3B) for enhanced sentence similarity.
- Apply MeSH-based supervision for fine-grained semantic modeling.
Topics
- Hierarchical Contrastive Learning
- Biomedical Information Retrieval
- MeSH Labels
- Dense Retrievers
- LoRA Fine-Tuning
Code references
Best for: NLP Engineer, AI Scientist, Research Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.