HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling
Summary
The Hierarchical Decoupling Framework (HiDe) addresses Multimodal Large Language Models' (MLLMs) suboptimal performance on high-resolution images by identifying complex background interference, not small object size, as the primary limitation. HiDe is a training-free framework that employs Token-wise Attention Decoupling (TAD) to identify key information tokens and align them with target visual regions. It then uses Layout-Preserving Decoupling (LPD) to remove background interference and reconstruct a compact, spatially preserved representation. HiDe achieves new state-of-the-art results on V*Bench, HRBench4K, and HRBench8K, boosting Qwen2.5-VL 7B and InternVL3 8B to 92.1% and 91.6% on V*Bench, respectively, even surpassing RL methods. Furthermore, it reduces memory usage by 75%, from 96 GB to 20 GB, compared to prior training-free methods.
Key takeaway
For machine learning engineers optimizing Multimodal Large Language Models for high-resolution image tasks, HiDe offers a compelling training-free solution. You should consider integrating this framework to significantly improve accuracy on fine-grained visual understanding benchmarks like V*Bench, HRBench4K, and HRBench8K. Its 75% memory reduction (from 96 GB to 20 GB) also makes it highly practical for deployment, enabling better performance without costly retraining or extensive hardware upgrades.
Key insights
MLLMs' high-resolution image issues stem from background interference, not object size, which HiDe effectively addresses.
Principles
- Cropping, not upscaling, improves MLLM high-resolution performance.
- Background semantics and redundant tokens distract MLLMs.
- Preserving object spatial layout is crucial for MLLM reasoning.
Method
HiDe employs Token-wise Attention Decoupling (TAD) to purify attention maps and Layout-Preserving Decoupling (LPD) to extract and reconstruct target regions, preserving spatial layout.
In practice
- Focus on semantic tokens for precise object localization.
- Eliminate background noise and redundant tokens.
- Reconstruct regions preserving spatial layout.
Topics
- Multimodal LLMs
- High-Resolution Vision
- Visual Question Answering
- Attention Decoupling
- Training-Free Methods
- Memory Efficiency
Code references
Best for: Research Scientist, Computer Vision Engineer, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.