HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

The Hierarchical Decoupling Framework (HiDe) addresses Multimodal Large Language Models' (MLLMs) suboptimal performance on high-resolution images by identifying complex background interference, not small object size, as the primary limitation. HiDe is a training-free framework that employs Token-wise Attention Decoupling (TAD) to identify key information tokens and align them with target visual regions. It then uses Layout-Preserving Decoupling (LPD) to remove background interference and reconstruct a compact, spatially preserved representation. HiDe achieves new state-of-the-art results on V*Bench, HRBench4K, and HRBench8K, boosting Qwen2.5-VL 7B and InternVL3 8B to 92.1% and 91.6% on V*Bench, respectively, even surpassing RL methods. Furthermore, it reduces memory usage by 75%, from 96 GB to 20 GB, compared to prior training-free methods.

Key takeaway

For machine learning engineers optimizing Multimodal Large Language Models for high-resolution image tasks, HiDe offers a compelling training-free solution. You should consider integrating this framework to significantly improve accuracy on fine-grained visual understanding benchmarks like V*Bench, HRBench4K, and HRBench8K. Its 75% memory reduction (from 96 GB to 20 GB) also makes it highly practical for deployment, enabling better performance without costly retraining or extensive hardware upgrades.

Key insights

MLLMs' high-resolution image issues stem from background interference, not object size, which HiDe effectively addresses.

Principles

Method

HiDe employs Token-wise Attention Decoupling (TAD) to purify attention maps and Layout-Preserving Decoupling (LPD) to extract and reconstruct target regions, preserving spatial layout.

In practice

Topics

Code references

Best for: Research Scientist, Computer Vision Engineer, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.