Pretrain Where? Investigating How Pretraining Data Diversity Impacts Geospatial Foundation Model Performance

· Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Geospatial Data Analysis · Depth: Expert, extended

Summary

A systematic study investigated how the geographic composition of pretraining data impacts geospatial foundation model performance, a factor often overlooked in favor of architectural or modality differences. Researchers created global and per-continent pretraining datasets, evaluating them on corresponding global and local downstream tasks using SatMAE, a Masked Autoencoder-based foundation model. Surprisingly, the Europe-only pretraining dataset consistently outperformed global and other continent-specific datasets across various global and local evaluations, including FMoW, MOSAIKS population density, ForTy, and GEO-Bench tasks. This performance advantage, ranging from 10 to 21 metric points, persisted even with extensive finetuning. Further analysis revealed that per-sample spectral entropy was strongly correlated with downstream performance ($ρ=0.84$, p=0.002), while diversity across continents, biomes, and landcover types showed only weak correlations. The study open-sourced 7 new pretraining datasets, pretrained models, and an experimental framework.

Key takeaway

For Computer Vision Engineers developing geospatial foundation models, you should prioritize per-sample spectral diversity when constructing pretraining datasets, as it strongly correlates with downstream performance. Do not assume that globally distributed or geographically aligned pretraining data will yield optimal results; instead, empirically evaluate diverse data compositions. Your choice of pretraining data significantly influences model performance, even with large finetuning datasets, making principled dataset design a critical factor for robust model generalization.

Key insights

Geographic pretraining data composition significantly impacts geospatial foundation model performance, with spectral diversity being a key driver.

Principles

Method

The study involved creating spatially varied pretraining datasets (global, continent-specific), pretraining SatMAE, finetuning on global and local downstream tasks, and analyzing performance correlations with diversity measures.

In practice

Topics

Code references

Best for: Computer Vision Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.