PubMed-Ophtha: An open resource for training ophthalmology vision-language models on scientific literature

2026-05-04 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

PubMed-Ophtha is a new open-access dataset designed for training vision-language models in ophthalmology, comprising 102,023 image-caption pairs derived from 15,842 open-access articles in PubMed Central. This dataset extracts figures directly from PDF articles at full resolution, decomposing them into individual panels, identifiers, and images. Each image is further annotated with its specific imaging modality, such as color fundus photography or optical coherence tomography, and a "mark status" indicating the presence of annotations like arrows. A two-step LLM approach was used to split figure captions into panel-level subcaptions, achieving a mean average sentence BLEU score of 0.913. The dataset generation pipeline includes panel and image detection models with mAP@0.50 scores of 0.909 and 0.892, respectively, and figure extraction with a median IoU of 0.997. The creators also provide human-annotated ground-truth data, trained models, and the full generation pipeline to ensure reproducibility.

Key takeaway

For Computer Vision Engineers developing ophthalmology vision-language models, PubMed-Ophtha offers a critical, high-quality dataset to overcome current data scarcity. You should integrate this resource into your training pipelines to enhance model performance and leverage its detailed image and caption annotations. The provided generation pipeline also serves as a valuable blueprint for creating similar domain-specific datasets.

Key insights

PubMed-Ophtha provides a high-quality, large-scale ophthalmology image-text dataset for vision-language model development.

Principles

High-resolution image extraction is crucial.
Panel-level captioning improves granularity.
Reproducibility requires open data and models.

Method

Figures are extracted from PDFs, decomposed into panels, and annotated with modality and mark status. LLMs split captions into panel-level subcaptions, achieving high BLEU scores.

In practice

Utilize PubMed-Ophtha for ophthalmology VLMs.
Adopt panel-level captioning for medical images.
Release pipelines for dataset reproducibility.

Topics

PubMed-Ophtha
Vision-Language Models
Ophthalmology Datasets
Scientific Literature Extraction
Large Language Models

Best for: Computer Vision Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.