Pretrain Where? Investigating How Pretraining Data Diversity Impacts Geospatial Foundation Model Performance

2026-04-24 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Geospatial Data Analysis · Depth: Expert, extended

Summary

A systematic study investigated how the geographic composition of pretraining data impacts geospatial foundation model performance, a factor often overlooked in favor of architectural or modality differences. Researchers created global and per-continent pretraining datasets, evaluating them on corresponding global and local downstream tasks using SatMAE, a Masked Autoencoder-based foundation model. Surprisingly, the Europe-only pretraining dataset consistently outperformed global and other continent-specific datasets across various global and local evaluations, including FMoW, MOSAIKS population density, ForTy, and GEO-Bench tasks. This performance advantage, ranging from 10 to 21 metric points, persisted even with extensive finetuning. Further analysis revealed that per-sample spectral entropy was strongly correlated with downstream performance ($ρ=0.84$, p=0.002), while diversity across continents, biomes, and landcover types showed only weak correlations. The study open-sourced 7 new pretraining datasets, pretrained models, and an experimental framework.

Key takeaway

For Computer Vision Engineers developing geospatial foundation models, you should prioritize per-sample spectral diversity when constructing pretraining datasets, as it strongly correlates with downstream performance. Do not assume that globally distributed or geographically aligned pretraining data will yield optimal results; instead, empirically evaluate diverse data compositions. Your choice of pretraining data significantly influences model performance, even with large finetuning datasets, making principled dataset design a critical factor for robust model generalization.

Key insights

Geographic pretraining data composition significantly impacts geospatial foundation model performance, with spectral diversity being a key driver.

Principles

Geographic alignment does not guarantee optimal performance.
Spectral entropy correlates strongly with downstream performance.
Pretraining impact persists even with large finetuning datasets.

Method

The study involved creating spatially varied pretraining datasets (global, continent-specific), pretraining SatMAE, finetuning on global and local downstream tasks, and analyzing performance correlations with diversity measures.

In practice

Prioritize spectral diversity in geospatial pretraining data.
Evaluate pretraining datasets beyond geographic matching.
Consider Europe-centric data for strong baseline performance.

Topics

Geospatial Foundation Models
Pretraining Data Diversity
Spectral Entropy
SatMAE Architecture
Remote Sensing

Code references

kerner-lab/pretrain-where

Best for: Computer Vision Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.