Pretrain Where? Investigating How Pretraining Data Diversity Impacts Geospatial Foundation Model Performance
Summary
A systematic study investigated how the geographic composition of pretraining data impacts geospatial foundation model performance, a factor often overlooked in favor of architectural or modality differences. Researchers created global and per-continent pretraining datasets, evaluating them on corresponding global and local downstream tasks using SatMAE, a Masked Autoencoder-based foundation model. Surprisingly, the Europe-only pretraining dataset consistently outperformed global and other continent-specific datasets across various global and local evaluations, including FMoW, MOSAIKS population density, ForTy, and GEO-Bench tasks. This performance advantage, ranging from 10 to 21 metric points, persisted even with extensive finetuning. Further analysis revealed that per-sample spectral entropy was strongly correlated with downstream performance ($ρ=0.84$, p=0.002), while diversity across continents, biomes, and landcover types showed only weak correlations. The study open-sourced 7 new pretraining datasets, pretrained models, and an experimental framework.
Key takeaway
For Computer Vision Engineers developing geospatial foundation models, you should prioritize per-sample spectral diversity when constructing pretraining datasets, as it strongly correlates with downstream performance. Do not assume that globally distributed or geographically aligned pretraining data will yield optimal results; instead, empirically evaluate diverse data compositions. Your choice of pretraining data significantly influences model performance, even with large finetuning datasets, making principled dataset design a critical factor for robust model generalization.
Key insights
Geographic pretraining data composition significantly impacts geospatial foundation model performance, with spectral diversity being a key driver.
Principles
- Geographic alignment does not guarantee optimal performance.
- Spectral entropy correlates strongly with downstream performance.
- Pretraining impact persists even with large finetuning datasets.
Method
The study involved creating spatially varied pretraining datasets (global, continent-specific), pretraining SatMAE, finetuning on global and local downstream tasks, and analyzing performance correlations with diversity measures.
In practice
- Prioritize spectral diversity in geospatial pretraining data.
- Evaluate pretraining datasets beyond geographic matching.
- Consider Europe-centric data for strong baseline performance.
Topics
- Geospatial Foundation Models
- Pretraining Data Diversity
- Spectral Entropy
- SatMAE Architecture
- Remote Sensing
Code references
Best for: Computer Vision Engineer, AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.