EyeMVP: OCT-Informed Fundus Representation Learning via Paired CFP--OCT Pretraining
Summary
EyeMVP is a novel cross-modal retinal foundation model designed to enhance color fundus photography (CFP) representations with depth-resolved information from optical coherence tomography (OCT). It was pretrained on 674,893 same-eye same-day paired CFP-OCT image triples from 112,642 patients across eight hospitals in China. The model employs cross-modal masked reconstruction, using source-constrained cross-attention and CFP-derived structural masks to align non-aligned imaging geometries. EyeMVP requires only CFP images for inference, making it suitable for screening. Across 16 downstream tasks, including classification and segmentation, EyeMVP consistently outperformed other retinal foundation models, particularly on tasks involving macular and optic nerve structure. For CFP-challenging macular diseases, it achieved an AUROC of 0.948 for macular edema, significantly higher than EyeCLIP's 0.852, and 0.825 for myopic macular schisis. An exploratory reader study indicated EyeMVP surpassed junior and intermediate ophthalmologists on macular edema and showed numerically higher balanced accuracy than all reader groups on myopic macular schisis.
Key takeaway
For AI Scientists developing diagnostic tools for ophthalmology, EyeMVP demonstrates a powerful method to enhance unimodal screening. Its approach of integrating OCT-informed representations into CFP models significantly boosts diagnostic accuracy for complex macular diseases like macular edema and myopic macular schisis. You should explore cross-modal masked reconstruction and source-constrained cross-attention to improve existing CFP-based models, particularly for conditions requiring depth information. This offers a practical route to stronger, more accurate retinal analysis in screening settings.
Key insights
EyeMVP uses paired CFP-OCT pretraining to enrich CFP representations with OCT depth information for improved retinal analysis.
Principles
- Pixel-level cross-modal reconstruction enriches CFP with OCT supervision.
- Cross-attention with structural masks accommodates non-aligned imaging.
- Paired CFP-OCT pretraining improves performance on macular/optic nerve tasks.
Method
EyeMVP uses cross-modal masked reconstruction with source-constrained cross-attention and CFP-derived structural masks, pretrained on paired CFP-OCT image triples. It requires only CFP for inference.
In practice
- Enhance CFP-based retinal analysis in screening settings.
- Improve diagnosis of CFP-challenging macular diseases.
- Potentially aid junior ophthalmologists in specific diagnoses.
Topics
- EyeMVP
- Cross-modal Learning
- Retinal Screening
- Optical Coherence Tomography
- Color Fundus Photography
- Macular Disease Diagnosis
Best for: Computer Vision Engineer, AI Scientist, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.