DUALVISION: RGB-Infrared Multimodal Large Language Models for Robust Visual Reasoning
Summary
DualVision is a novel, lightweight fusion module designed to enhance the robustness of Multimodal Large Language Models (MLLMs) in visual reasoning tasks by integrating both RGB and infrared (IR) imagery. Traditional MLLMs, which primarily rely on RGB data, often fail under adverse conditions like fog, blur, or low light. DualVision addresses this by employing patch-level localized cross-attention to efficiently combine IR and RGB information, reducing computational overhead by approximately 75% compared to naive fusion. To facilitate its development and evaluation, the researchers also introduced two new datasets: DV-204K, comprising around 25K aligned IR-RGB image pairs with 204K modality-specific QA annotations for instruction tuning, and DV-500, a benchmark of 500 IR-RGB image pairs with 500 QA pairs for evaluating cross-modal reasoning under various visual degradations. Benchmarking against open and closed-source MLLMs demonstrated DualVision's superior empirical performance and improved robustness.
Key takeaway
Research Scientists developing MLLMs for real-world applications like autonomous driving should consider integrating infrared data with RGB inputs. DualVision's lightweight, multi-scale localized cross-attention fusion module offers a proven method to significantly improve model robustness under common visual degradations, such as fog or low light, while maintaining computational efficiency. You should explore similar fusion architectures and degradation-aware training protocols to enhance the reliability of your MLLMs in challenging environments.
Key insights
Integrating infrared and RGB data via localized cross-attention significantly boosts MLLM robustness in degraded visual conditions.
Principles
- IR complements RGB for robust perception.
- Localized cross-attention reduces computational overhead.
- Degradation-aware training improves model resilience.
Method
DualVision uses multi-scale localized cross-attention to fuse RGB and IR patch tokens, allowing RGB tokens to attend only to spatially corresponding IR regions. This hierarchical approach, with progressively expanding local attention radii, creates a unified IR-RGB representation.
In practice
- Use IR-RGB fusion for MLLMs in low-visibility scenarios.
- Employ localized cross-attention to optimize fusion compute.
- Train MLLMs with degraded inputs to enhance robustness.
Topics
- DualVision
- RGB-Infrared Fusion
- Multimodal Large Language Models
- Visual Degradation Robustness
- DV-204K Dataset
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.