ViBE: Visual-to-M/EEG Brain Encoding via Spatio-Temporal VAE and Distribution-Aligned Projection
Summary
ViBE (Visual-to-M/EEG Brain Encoding) is a novel two-stage framework designed to generate magnetoencephalography (MEG) and electroencephalography (EEG) signals from visual stimuli, addressing limitations in existing brain encoding models and visual prostheses. The first stage involves a spatio-temporal convolutional variational autoencoder (TSC-VAE) that reconstructs M/EEG signals by capturing their hierarchical spatio-temporal characteristics, achieving a Pearson correlation of 0.941 on THINGS-EEG2 and 0.981 on THINGS-MEG. The second stage employs a Q-Former to map CLIP image embeddings to the TSC-VAE's latent space, producing neural proxy embeddings. This mapping uses a combined mean squared error (MSE) loss for point-wise feature matching and sliced Wasserstein distance (SWD) for probability distribution alignment, effectively bridging a significant modality gap. Experiments on the THINGS-EEG2 and THINGS-MEG datasets demonstrate ViBE's effectiveness, achieving Pearson correlations of 0.635 and 0.543 respectively, outperforming previous methods.
Key takeaway
For Machine Learning Engineers developing brain-computer interfaces or visual prostheses, ViBE offers a robust framework for converting visual stimuli into M/EEG signals. Your models should incorporate a two-stage approach: first, a high-fidelity autoencoder for neural signal reconstruction, and second, a cross-modal mapping component that explicitly addresses feature scale and distribution alignment between visual and neural representations. Consider using TSConvPlus and a combined MSE/SWD loss to enhance signal fidelity and cross-modal generalization.
Key insights
ViBE generates high-fidelity M/EEG signals from visual stimuli by aligning visual and neural representations through a two-stage VAE and Q-Former framework.
Principles
- Hierarchical spatio-temporal convolutions improve M/EEG signal reconstruction.
- Bridging modality gaps requires both feature and distribution alignment.
- Occipital and temporal cortices are critical for visual encoding.
Method
ViBE uses a TSC-VAE for M/EEG reconstruction, then a Q-Former maps CLIP image embeddings to the TSC-VAE latent space. Alignment is achieved via combined MSE and Sliced Wasserstein Distance loss.
In practice
- Use TSConvPlus with spatial kernel size $k_s < C$ for M/EEG processing.
- Combine MSE and SWD for robust cross-modal embedding alignment.
- Focus on occipital/temporal channels for visual pathway encoding.
Topics
- Brain Encoding
- Visual Prostheses
- Spatio-Temporal VAE
- Q-Former
- Cross-Modal Alignment
Best for: AI Scientist, Research Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.