FUSAR-GPT : A Spatiotemporal Feature-Embedded and Two-Stage Decoupled Visual Language Model for SAR Imagery

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Remote Sensing AI · Depth: Expert, extended

Summary

FUSAR-GPT is a novel Visual Language Model (VLM) designed to overcome the limitations of general VLMs when interpreting Synthetic Aperture Radar (SAR) imagery. Developed by Fudan University, this model addresses challenges like SAR-optical modal differences, neglected geospatial priors, and information sparsity. FUSAR-GPT integrates a geospatial baseline model as "world knowledge" and embeds multi-source remote-sensing temporal features using "spatiotemporal anchors" to dynamically compensate for sparse SAR target representations. It also employs a two-stage Supervised Fine-Tuning (SFT) strategy, decoupling knowledge injection from task execution. Built on Qwen2.5-VL-7B, FUSAR-GPT achieved leading performance, significantly outperforming mainstream baseline models by over 12% across various remote sensing visual-language benchmarks, including target counting, spatial localization, classification, and detection.

Key takeaway

For Machine Learning Engineers developing Vision Language Models for Synthetic Aperture Radar (SAR) applications, directly applying general VLMs will yield suboptimal results. You should integrate multi-source geospatial priors, such as AlphaEarth Foundations, to dynamically compensate for SAR's inherent data sparsity. Implement a two-stage supervised fine-tuning strategy to first inject domain knowledge and then activate task-specific reasoning, significantly improving performance on critical remote sensing tasks like target detection and classification.

Key insights

FUSAR-GPT enhances SAR VLM performance by integrating geospatial priors and decoupling training stages for robust interpretation.

Principles

Method

FUSAR-GPT extracts 64-dimensional AlphaEarth Foundations (AEF) embeddings, aligns them with SAR images via spatio-temporal anchors, and fuses them using a Token-wise Linear Modulation (TLM) module. A two-stage SFT strategy optimizes cross-modal alignment and task reasoning.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.