PubMed-Ophtha: An open resource for training ophthalmology vision-language models on scientific literature
Summary
PubMed-Ophtha is a new open-access dataset designed for training vision-language models in ophthalmology, comprising 102,023 image-caption pairs derived from 15,842 open-access articles in PubMed Central. This dataset extracts figures directly from PDF articles at full resolution, decomposing them into individual panels, identifiers, and images. Each image is further annotated with its specific imaging modality, such as color fundus photography or optical coherence tomography, and a "mark status" indicating the presence of annotations like arrows. A two-step LLM approach was used to split figure captions into panel-level subcaptions, achieving a mean average sentence BLEU score of 0.913. The dataset generation pipeline includes panel and image detection models with mAP@0.50 scores of 0.909 and 0.892, respectively, and figure extraction with a median IoU of 0.997. The creators also provide human-annotated ground-truth data, trained models, and the full generation pipeline to ensure reproducibility.
Key takeaway
For Computer Vision Engineers developing ophthalmology vision-language models, PubMed-Ophtha offers a critical, high-quality dataset to overcome current data scarcity. You should integrate this resource into your training pipelines to enhance model performance and leverage its detailed image and caption annotations. The provided generation pipeline also serves as a valuable blueprint for creating similar domain-specific datasets.
Key insights
PubMed-Ophtha provides a high-quality, large-scale ophthalmology image-text dataset for vision-language model development.
Principles
- High-resolution image extraction is crucial.
- Panel-level captioning improves granularity.
- Reproducibility requires open data and models.
Method
Figures are extracted from PDFs, decomposed into panels, and annotated with modality and mark status. LLMs split captions into panel-level subcaptions, achieving high BLEU scores.
In practice
- Utilize PubMed-Ophtha for ophthalmology VLMs.
- Adopt panel-level captioning for medical images.
- Release pipelines for dataset reproducibility.
Topics
- PubMed-Ophtha
- Vision-Language Models
- Ophthalmology Datasets
- Scientific Literature Extraction
- Large Language Models
Best for: Computer Vision Engineer, AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.