Embed the world: Multimodal AI for searchable aerial imagery at scale
Summary
A multimodal AI system for natural-language-searchable aerial imagery, developed by AWS Generative AI Innovation Center (GenAIIC) and Vexcel, addresses the challenge of manually inspecting or training bespoke computer vision models for geospatial data. This system leverages multimodal embeddings, large language model (LLM) captioning, and vector search on AWS, specifically Amazon Bedrock and Amazon OpenSearch Serverless. Evaluated using OpenStreetMap ground truth in Grant Park, Chicago, across approximately 100 distinct configurations, the project identified key performance drivers. Amazon Nova Multimodal Embeddings delivered the highest F1 scores (0.621 for swimming pools, 0.555 for roads), and caption integration improved F1 scores by 11-13%. The optimal fusion and search strategies were found to be feature-type dependent. This work has since evolved into Vexcel Intelligence, a searchable imagery product.
Key takeaway
For AI Engineers developing geospatial semantic search systems, prioritize integrating FM-generated captions alongside image embeddings, as this delivered an 11-13% F1 score improvement. You should default to Amazon Nova Multimodal Embeddings and build an automated evaluation framework early to efficiently test ~100 configurations. Match fusion and search strategies to specific feature types, recognizing that no single approach dominates all queries, and avoid text-only search for optimal results.
Key insights
Multimodal AI, combining image embeddings and LLM captions, enables efficient natural-language search over complex aerial imagery.
Principles
- Embedding model choice significantly impacts geospatial search quality.
- LLM-generated caption integration is a high-ROI optimization.
- Optimal fusion and search strategies are feature-type dependent.
Method
A five-stage pipeline (Explore AOI, Ingest Imagery, Embed & Index, Search, Evaluate) uses Amazon Bedrock for embeddings/captions and Amazon OpenSearch Serverless for indexing, allowing modular experimentation across ~100 configurations.
In practice
- Default to Amazon Nova Multimodal Embeddings for geospatial search.
- Integrate FM-generated captions for 11-13% F1 score improvement.
- Skip elevation data (DSM/DTM) for standard object detection tasks.
Topics
- Multimodal AI
- Geospatial Imagery Search
- Vector Embeddings
- LLM Captioning
- Amazon Bedrock
- Amazon OpenSearch Serverless
- Vexcel Intelligence
Best for: Computer Vision Engineer, AI Scientist, Research Scientist, AI Engineer, Machine Learning Engineer, Data Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.