BioHiCL: Hierarchical Multi-Label Contrastive Learning for Biomedical Retrieval with MeSH Labels

2026-04-21 · Source: cs.AI updates on arXiv.org · Field: Science & Research — Artificial Intelligence & Machine Learning, Health & Medical Research · Depth: Expert, extended

Summary

BioHiCL (Biomedical Retrieval with Hierarchical Multi-Label Contrastive Learning) is a new framework for biomedical information retrieval that leverages hierarchical MeSH annotations to provide structured supervision. It addresses limitations of existing generative retrievers that rely on coarse binary relevance signals, which struggle to capture fine-grained semantic overlap in biomedical texts. BioHiCL adapts general-domain dense retrievers using parameter-efficient LoRA fine-tuning on 80K abstracts from the BioASQ Task 1a benchmark, incorporating depth-aware label similarity. The framework introduces two models, BioHiCL-Base (0.1B parameters) and BioHiCL-Large (0.3B parameters), which demonstrate strong performance across biomedical retrieval, sentence similarity, and question answering tasks, including benchmarks like NFCorpus, BIOSSES, and PubMedQA. These models also maintain computational efficiency, running effectively on a single A100 40GB GPU.

Key takeaway

For NLP Engineers developing biomedical information retrieval systems, BioHiCL offers a robust and efficient solution. You should consider integrating BioHiCL-Base or BioHiCL-Large to capture fine-grained semantic relationships using hierarchical MeSH labels, which can significantly improve performance on tasks like document retrieval, sentence similarity, and question answering while remaining computationally practical for deployment on a single A100 GPU.

Key insights

Hierarchical MeSH labels provide fine-grained supervision for biomedical dense retrieval, improving semantic representation.

Principles

Align embedding similarity with depth-weighted MeSH label similarity.
Combine regression and contrastive losses to prevent embedding collapse.
LoRA fine-tuning efficiently adapts general-domain retrievers.

Method

BioHiCL uses LoRA fine-tuning to align embedding similarity (SimE) with depth-weighted MeSH label similarity (SimL) via a filtered regression objective and a hierarchy-aware contrastive loss, emphasizing specific concepts.

In practice

Use BioHiCL-Base (0.1B) for efficient, strong biomedical retrieval.
Consider BioHiCL-Large (0.3B) for enhanced sentence similarity.
Apply MeSH-based supervision for fine-grained semantic modeling.

Topics

Hierarchical Contrastive Learning
Biomedical Information Retrieval
MeSH Labels
Dense Retrievers
LoRA Fine-Tuning

Code references

Best for: NLP Engineer, AI Scientist, Research Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.