DBMF: A Dual-Branch Multimodal Framework for Out-of-Distribution Detection

2026-04-16 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Health & Medical Research · Depth: Expert, long

Summary

The Dual-Branch Multimodal Framework (DBMF) is a novel deep learning system designed for out-of-distribution (OOD) detection in medical imaging, specifically endoscopic analysis. It addresses limitations of existing methods by fully leveraging multimodal information through two complementary branches: a text-image branch and a vision branch. The text-image branch uses a new text-separation contrastive loss ($L_{TSC}$) to enhance textual modality, while the vision branch is trained with a traditional cross-entropy loss ($L_{CE}$). After training, scores from both branches ($S_{t}$ and $S_{v}$) are integrated to produce a final OOD score $S$. Experiments on Kvasir-v2 and GastroVision endoscopic image datasets demonstrate DBMF's robustness across diverse backbones like ResNet18 and DeiT, improving state-of-the-art OOD detection performance by up to 24.84% in FPR95 and 3.81% in AUROC on the GastroVision dataset.

Key takeaway

For Computer Vision Engineers developing medical imaging diagnostics, DBMF offers a robust approach to identifying out-of-distribution data. You should consider integrating its dual-branch multimodal architecture to improve the reliability and generalizability of your deep learning models, particularly in endoscopic image analysis, to prevent overconfident predictions on unseen disease cases and trigger necessary human review.

Key insights

DBMF enhances OOD detection in medical imaging by combining text-image and vision branches for robust multimodal analysis.

Principles

Multimodal fusion improves OOD detection.
Complementary branches enhance model robustness.
Text-separation loss optimizes text-image alignment.

Method

DBMF trains a text-image branch with $L_{TSC}$ and a vision branch with $L_{CE}$. It then combines their respective scores, $S_{t}$ and $S_{v}$, into a final OOD score $S$ for threshold-based detection.

In practice

Apply DBMF to medical image analysis.
Use ResNet18 or DeiT as image encoder backbones.
Consider prompt learning for text generation.

Topics

DBMF Framework
Out-of-Distribution Detection
Multimodal Deep Learning
Endoscopic Image Analysis
Text-Separation Contrastive Loss

Best for: Computer Vision Engineer, AI Scientist, Research Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.