Training-Free Open-Vocabulary Visual Grounding for Remote Sensing Images and Videos

2026-06-15 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

RSVG-ZeroOV is a novel training-free framework designed for open-vocabulary visual grounding in remote sensing images and videos. It addresses the limitations of existing methods that require expensive manual annotations and struggle with diverse, novel queries. The framework utilizes frozen generic foundation models, specifically vision-language models (VLMs) and diffusion models (DMs), through an "Overview-Focus-Evolve" paradigm. The Overview stage uses VLMs for semantic correlation via cross-attention maps, while Focus employs DMs to refine object structure and shape. Evolve then purifies object masks by suppressing irrelevant activations. For video inputs, Video RSVG-ZeroOV extends this capability to spatio-temporal grounding using a query-relevant key-frame selector and a temporal propagator, achieving efficient and coherent results without video annotations or fine-tuning. Experiments on six image and video grounding benchmarks demonstrate that RSVG-ZeroOV consistently surpasses existing zero-shot baselines and achieves competitive or superior performance against weakly- and fully-supervised approaches.

Key takeaway

For Computer Vision Engineers developing remote sensing applications, if you need to localize objects in images or videos without costly manual annotations, RSVG-ZeroOV offers a powerful training-free solution. You can achieve competitive or superior performance compared to supervised methods, even for open-vocabulary queries involving novel objects or complex relationships. Consider integrating this VLM-DM fusion approach to significantly reduce annotation burdens and improve generalization across diverse geospatial scenarios.

Key insights

Training-free RSVG-ZeroOV combines VLMs and DMs for precise open-vocabulary visual grounding in remote sensing data.

Principles

Combine VLM attention with DM priors.
Progressive refinement improves grounding.
Zero-shot methods can outperform supervised.

Method

RSVG-ZeroOV uses an Overview-Focus-Evolve paradigm: VLM extracts cross-attention, DM refines object structure, and an attention evolution module purifies masks. Video extension uses key-frame selection and temporal propagation.

In practice

Apply RSVG-ZeroOV for zero-shot object localization.
Use VLM/DM fusion for fine-grained grounding.
Extend image grounding to video spatio-temporal tasks.

Topics

Remote Sensing Visual Grounding
Zero-shot Learning
Open-Vocabulary Detection
Vision-Language Models
Diffusion Models

Best for: AI Scientist, Computer Vision Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.