UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Environmental Science & Earth Systems · Depth: Expert, extended

Summary

UHR-BAT is a novel budget-aware token compression framework designed for ultra-high-resolution (UHR) remote sensing imagery, which often contains kilometer-scale context alongside query-critical evidence occupying only a few pixels. Traditional methods like downsampling or dense tiling either sacrifice detail or incur prohibitive computational costs. UHR-BAT addresses this by employing a query-guided, multi-scale importance estimation for visual tokens, coupled with region-wise preserve and merge strategies to mitigate redundancy and ensure region faithfulness. This approach efficiently selects visual tokens under strict context budgets, enabling the processing of UHR imagery on resource-constrained platforms. Experimental results on XLRS-Bench, RSHR-Bench, and MME-RealWorld-RS demonstrate that UHR-BAT achieves state-of-the-art performance, outperforming existing remote-sensing and general-purpose MLLMs, including GPT-4o and Claude 3.7 Sonnet, with significantly reduced token budgets and improved inference latency.

Key takeaway

For Computer Vision Engineers developing MLLMs for remote sensing, UHR-BAT offers a robust solution to the challenge of processing ultra-high-resolution imagery under strict computational budgets. You should consider integrating its query-guided, region-faithful token compression to maintain critical fine-grained details while significantly reducing inference latency and memory footprint, making deployment on edge devices more feasible.

Key insights

UHR-BAT efficiently processes ultra-high-resolution remote sensing imagery by intelligently compressing visual tokens based on query relevance and regional faithfulness.

Principles

Method

UHR-BAT uses text-guided, multi-scale importance estimation for visual tokens, followed by region-wise preserve and merge strategies to reduce redundancy and enforce a strict token budget.

In practice

Topics

Code references

Best for: Computer Vision Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.