ViBE: Visual-to-M/EEG Brain Encoding via Spatio-Temporal VAE and Distribution-Aligned Projection

2026-04-30 · Source: cs.CV updates on arXiv.org · Field: Science & Research — Life Sciences & Biology, Health & Medical Research, Mathematics & Computational Sciences · Depth: Expert, extended

Summary

ViBE (Visual-to-M/EEG Brain Encoding) is a novel two-stage framework designed to generate magnetoencephalography (MEG) and electroencephalography (EEG) signals from visual stimuli, addressing limitations in existing brain encoding models and visual prostheses. The first stage involves a spatio-temporal convolutional variational autoencoder (TSC-VAE) that reconstructs M/EEG signals by capturing their hierarchical spatio-temporal characteristics, achieving a Pearson correlation of 0.941 on THINGS-EEG2 and 0.981 on THINGS-MEG. The second stage employs a Q-Former to map CLIP image embeddings to the TSC-VAE's latent space, producing neural proxy embeddings. This mapping uses a combined mean squared error (MSE) loss for point-wise feature matching and sliced Wasserstein distance (SWD) for probability distribution alignment, effectively bridging a significant modality gap. Experiments on the THINGS-EEG2 and THINGS-MEG datasets demonstrate ViBE's effectiveness, achieving Pearson correlations of 0.635 and 0.543 respectively, outperforming previous methods.

Key takeaway

For Machine Learning Engineers developing brain-computer interfaces or visual prostheses, ViBE offers a robust framework for converting visual stimuli into M/EEG signals. Your models should incorporate a two-stage approach: first, a high-fidelity autoencoder for neural signal reconstruction, and second, a cross-modal mapping component that explicitly addresses feature scale and distribution alignment between visual and neural representations. Consider using TSConvPlus and a combined MSE/SWD loss to enhance signal fidelity and cross-modal generalization.

Key insights

ViBE generates high-fidelity M/EEG signals from visual stimuli by aligning visual and neural representations through a two-stage VAE and Q-Former framework.

Principles

Hierarchical spatio-temporal convolutions improve M/EEG signal reconstruction.
Bridging modality gaps requires both feature and distribution alignment.
Occipital and temporal cortices are critical for visual encoding.

Method

ViBE uses a TSC-VAE for M/EEG reconstruction, then a Q-Former maps CLIP image embeddings to the TSC-VAE latent space. Alignment is achieved via combined MSE and Sliced Wasserstein Distance loss.

In practice

Use TSConvPlus with spatial kernel size $k_s < C$ for M/EEG processing.
Combine MSE and SWD for robust cross-modal embedding alignment.
Focus on occipital/temporal channels for visual pathway encoding.

Topics

Brain Encoding
Visual Prostheses
Spatio-Temporal VAE
Q-Former
Cross-Modal Alignment

Best for: AI Scientist, Research Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.