DUALVISION: RGB-Infrared Multimodal Large Language Models for Robust Visual Reasoning

· Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Computer Vision · Depth: Expert, extended

Summary

DualVision is a novel, lightweight fusion module designed to enhance the robustness of Multimodal Large Language Models (MLLMs) in visual reasoning tasks by integrating both RGB and infrared (IR) imagery. Traditional MLLMs, which primarily rely on RGB data, often fail under adverse conditions like fog, blur, or low light. DualVision addresses this by employing patch-level localized cross-attention to efficiently combine IR and RGB information, reducing computational overhead by approximately 75% compared to naive fusion. To facilitate its development and evaluation, the researchers also introduced two new datasets: DV-204K, comprising around 25K aligned IR-RGB image pairs with 204K modality-specific QA annotations for instruction tuning, and DV-500, a benchmark of 500 IR-RGB image pairs with 500 QA pairs for evaluating cross-modal reasoning under various visual degradations. Benchmarking against open and closed-source MLLMs demonstrated DualVision's superior empirical performance and improved robustness.

Key takeaway

Research Scientists developing MLLMs for real-world applications like autonomous driving should consider integrating infrared data with RGB inputs. DualVision's lightweight, multi-scale localized cross-attention fusion module offers a proven method to significantly improve model robustness under common visual degradations, such as fog or low light, while maintaining computational efficiency. You should explore similar fusion architectures and degradation-aware training protocols to enhance the reliability of your MLLMs in challenging environments.

Key insights

Integrating infrared and RGB data via localized cross-attention significantly boosts MLLM robustness in degraded visual conditions.

Principles

Method

DualVision uses multi-scale localized cross-attention to fuse RGB and IR patch tokens, allowing RGB tokens to attend only to spatially corresponding IR regions. This hierarchical approach, with progressively expanding local attention radii, creates a unified IR-RGB representation.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.