HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling

2025-02-02 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

The Hierarchical Decoupling Framework (HiDe) addresses Multimodal Large Language Models' (MLLMs) suboptimal performance on high-resolution images by identifying complex background interference, not small object size, as the primary limitation. HiDe is a training-free framework that employs Token-wise Attention Decoupling (TAD) to identify key information tokens and align them with target visual regions. It then uses Layout-Preserving Decoupling (LPD) to remove background interference and reconstruct a compact, spatially preserved representation. HiDe achieves new state-of-the-art results on V*Bench, HRBench4K, and HRBench8K, boosting Qwen2.5-VL 7B and InternVL3 8B to 92.1% and 91.6% on V*Bench, respectively, even surpassing RL methods. Furthermore, it reduces memory usage by 75%, from 96 GB to 20 GB, compared to prior training-free methods.

Key takeaway

For machine learning engineers optimizing Multimodal Large Language Models for high-resolution image tasks, HiDe offers a compelling training-free solution. You should consider integrating this framework to significantly improve accuracy on fine-grained visual understanding benchmarks like V*Bench, HRBench4K, and HRBench8K. Its 75% memory reduction (from 96 GB to 20 GB) also makes it highly practical for deployment, enabling better performance without costly retraining or extensive hardware upgrades.

Key insights

MLLMs' high-resolution image issues stem from background interference, not object size, which HiDe effectively addresses.

Principles

Cropping, not upscaling, improves MLLM high-resolution performance.
Background semantics and redundant tokens distract MLLMs.
Preserving object spatial layout is crucial for MLLM reasoning.

Method

HiDe employs Token-wise Attention Decoupling (TAD) to purify attention maps and Layout-Preserving Decoupling (LPD) to extract and reconstruct target regions, preserving spatial layout.

In practice

Focus on semantic tokens for precise object localization.
Eliminate background noise and redundant tokens.
Reconstruct regions preserving spatial layout.

Topics

Multimodal LLMs
High-Resolution Vision
Visual Question Answering
Attention Decoupling
Training-Free Methods
Memory Efficiency

Code references

Tennine2077/HiDe

Best for: Research Scientist, Computer Vision Engineer, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.