UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing
Summary
UHR-BAT is a novel budget-aware token compression framework designed for ultra-high-resolution (UHR) remote sensing imagery, which often contains kilometer-scale context alongside query-critical evidence occupying only a few pixels. Traditional methods like downsampling or dense tiling either sacrifice detail or incur prohibitive computational costs. UHR-BAT addresses this by employing a query-guided, multi-scale importance estimation for visual tokens, coupled with region-wise preserve and merge strategies to mitigate redundancy and ensure region faithfulness. This approach efficiently selects visual tokens under strict context budgets, enabling the processing of UHR imagery on resource-constrained platforms. Experimental results on XLRS-Bench, RSHR-Bench, and MME-RealWorld-RS demonstrate that UHR-BAT achieves state-of-the-art performance, outperforming existing remote-sensing and general-purpose MLLMs, including GPT-4o and Claude 3.7 Sonnet, with significantly reduced token budgets and improved inference latency.
Key takeaway
For Computer Vision Engineers developing MLLMs for remote sensing, UHR-BAT offers a robust solution to the challenge of processing ultra-high-resolution imagery under strict computational budgets. You should consider integrating its query-guided, region-faithful token compression to maintain critical fine-grained details while significantly reducing inference latency and memory footprint, making deployment on edge devices more feasible.
Key insights
UHR-BAT efficiently processes ultra-high-resolution remote sensing imagery by intelligently compressing visual tokens based on query relevance and regional faithfulness.
Principles
- Compression must be query-guided.
- Compression must be region-faithful.
Method
UHR-BAT uses text-guided, multi-scale importance estimation for visual tokens, followed by region-wise preserve and merge strategies to reduce redundancy and enforce a strict token budget.
In practice
- Utilize multi-scale token streams with scale-specific positional embeddings.
- Implement region-wise preserve-and-merge for efficient token selection.
Topics
- Ultra-High-Resolution Remote Sensing
- Vision-Language Models
- Token Compression
- Query-Guided Token Selection
- Region-wise Preserve-and-Merge
Code references
Best for: Computer Vision Engineer, AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.