Learning Sparse Latent Predictive Foundation Model for Multimodal Neuroimaging

2026-06-12 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Health & Medical Research, Computer Vision & Pattern Recognition · Depth: Expert, quick

Summary

Neuro-JEPA is a sparse multimodal neuroimaging foundation model designed to learn unified representations across various MRI contrast mechanisms. It combines a latent predictive objective with a Mixture-of-Experts architecture to encode T1w, T2w, and FLAIR imaging. Pretrained on 1,551,862 scans from 428,647 studies, Neuro-JEPA underwent systematic methodological study for architectural, masking, objective, and sparsity design choices. Evaluated across 25 tasks from three health systems (NYU Langone, NYU Long Island, Massachusetts General Hospital) and 22 tasks from 12 public datasets, Neuro-JEPA consistently outperformed a simple convolutional neural network baseline and existing neuroimaging foundation models. This establishes a scalable framework for multimodal neuroimaging representation learning.

Key takeaway

For AI Scientists and Machine Learning Engineers developing neuroimaging solutions, consider Neuro-JEPA's architectural approach for robust multimodal representation learning. Its consistent performance across diverse clinical and public datasets suggests that incorporating latent predictive objectives and Mixture-of-Experts architectures can yield superior results. You should also prioritize evaluation protocols that include simple baselines and clinically heterogeneous cohorts to validate foundation models effectively.

Key insights

Neuro-JEPA is a sparse multimodal neuroimaging foundation model using a latent predictive objective and Mixture-of-Experts.

Principles

Systematic study of architectural, masking, objective, and sparsity choices.
Foundation model evaluation needs simple baselines and heterogeneous cohorts.

Method

Neuro-JEPA was pretrained on 1,551,862 scans from 428,647 studies, encoding T1w, T2w, and FLAIR imaging after modality-specific preprocessing.

In practice

Encode brain MRI across T1w, T2w, and FLAIR sequences.
Learn unified representations at health-system scale.

Topics

Neuroimaging
Foundation Models
Multimodal Learning
Mixture-of-Experts
Latent Predictive Models
MRI

Best for: Computer Vision Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.