EyeMVP: OCT-Informed Fundus Representation Learning via Paired CFP--OCT Pretraining

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Medical Imaging AI · Depth: Expert, quick

Summary

EyeMVP is a novel cross-modal retinal foundation model designed to enhance color fundus photography (CFP) representations with depth-resolved information from optical coherence tomography (OCT). It was pretrained on 674,893 same-eye same-day paired CFP-OCT image triples from 112,642 patients across eight hospitals in China. The model employs cross-modal masked reconstruction, using source-constrained cross-attention and CFP-derived structural masks to align non-aligned imaging geometries. EyeMVP requires only CFP images for inference, making it suitable for screening. Across 16 downstream tasks, including classification and segmentation, EyeMVP consistently outperformed other retinal foundation models, particularly on tasks involving macular and optic nerve structure. For CFP-challenging macular diseases, it achieved an AUROC of 0.948 for macular edema, significantly higher than EyeCLIP's 0.852, and 0.825 for myopic macular schisis. An exploratory reader study indicated EyeMVP surpassed junior and intermediate ophthalmologists on macular edema and showed numerically higher balanced accuracy than all reader groups on myopic macular schisis.

Key takeaway

For AI Scientists developing diagnostic tools for ophthalmology, EyeMVP demonstrates a powerful method to enhance unimodal screening. Its approach of integrating OCT-informed representations into CFP models significantly boosts diagnostic accuracy for complex macular diseases like macular edema and myopic macular schisis. You should explore cross-modal masked reconstruction and source-constrained cross-attention to improve existing CFP-based models, particularly for conditions requiring depth information. This offers a practical route to stronger, more accurate retinal analysis in screening settings.

Key insights

EyeMVP uses paired CFP-OCT pretraining to enrich CFP representations with OCT depth information for improved retinal analysis.

Principles

Method

EyeMVP uses cross-modal masked reconstruction with source-constrained cross-attention and CFP-derived structural masks, pretrained on paired CFP-OCT image triples. It requires only CFP for inference.

In practice

Topics

Best for: Computer Vision Engineer, AI Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.