Embed the world: Multimodal AI for searchable aerial imagery at scale

2026-06-22 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Cloud Computing & IT Infrastructure · Depth: Advanced, extended

Summary

A multimodal AI system for natural-language-searchable aerial imagery, developed by AWS Generative AI Innovation Center (GenAIIC) and Vexcel, addresses the challenge of manually inspecting or training bespoke computer vision models for geospatial data. This system leverages multimodal embeddings, large language model (LLM) captioning, and vector search on AWS, specifically Amazon Bedrock and Amazon OpenSearch Serverless. Evaluated using OpenStreetMap ground truth in Grant Park, Chicago, across approximately 100 distinct configurations, the project identified key performance drivers. Amazon Nova Multimodal Embeddings delivered the highest F1 scores (0.621 for swimming pools, 0.555 for roads), and caption integration improved F1 scores by 11-13%. The optimal fusion and search strategies were found to be feature-type dependent. This work has since evolved into Vexcel Intelligence, a searchable imagery product.

Key takeaway

For AI Engineers developing geospatial semantic search systems, prioritize integrating FM-generated captions alongside image embeddings, as this delivered an 11-13% F1 score improvement. You should default to Amazon Nova Multimodal Embeddings and build an automated evaluation framework early to efficiently test ~100 configurations. Match fusion and search strategies to specific feature types, recognizing that no single approach dominates all queries, and avoid text-only search for optimal results.

Key insights

Multimodal AI, combining image embeddings and LLM captions, enables efficient natural-language search over complex aerial imagery.

Principles

Embedding model choice significantly impacts geospatial search quality.
LLM-generated caption integration is a high-ROI optimization.
Optimal fusion and search strategies are feature-type dependent.

Method

A five-stage pipeline (Explore AOI, Ingest Imagery, Embed & Index, Search, Evaluate) uses Amazon Bedrock for embeddings/captions and Amazon OpenSearch Serverless for indexing, allowing modular experimentation across ~100 configurations.

In practice

Default to Amazon Nova Multimodal Embeddings for geospatial search.
Integrate FM-generated captions for 11-13% F1 score improvement.
Skip elevation data (DSM/DTM) for standard object detection tasks.

Topics

Multimodal AI
Geospatial Imagery Search
Vector Embeddings
LLM Captioning
Amazon Bedrock
Amazon OpenSearch Serverless
Vexcel Intelligence

Best for: Computer Vision Engineer, AI Scientist, Research Scientist, AI Engineer, Machine Learning Engineer, Data Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.