MMBU: A Massive Multi-modal Biomedical Understanding Benchmark to Probe the Perception Capabilities of Vision-Language Models

· Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Biomedical Imaging & Diagnostics · Depth: Expert, extended

Summary

The Massive Multimodal Biomedical Understanding (MMBU) benchmark is introduced as the largest biomedical vision and language benchmark to date, covering 35 submodalities across 11 modalities, 20 specimens, and 95 regions of interest from 410 datasets. It evaluates 15 open-weight and 2 frontier Vision-Language Models (VLMs) on ungrounded classification, grounded classification, and object detection tasks. Findings reveal that while medical adaptation offers modest gains for some models, high accuracy on established benchmarks often masks significant deficiencies in visual perception and domain generalization. Absolute F1 scores remain low, with the best closed-format result at 0.693, and a large average performance gap of 0.26 between closed and open-ended tasks. Object detection is a critical failure mode, with no VLM surpassing the random baseline (F1 = 0.172) in closed settings.

Key takeaway

For AI Scientists and Machine Learning Engineers developing biomedical VLMs, you should critically assess model robustness beyond established benchmarks. MMBU reveals that current models struggle with fine-grained visual perception, especially in object detection and open-ended tasks, where performance often falls below adequate thresholds. Focus your development efforts on improving spatial reasoning and creating adaptation strategies that demonstrate consistent generalization across diverse biomedical modalities and contexts, rather than optimizing solely for narrow, potentially data-polluted datasets.

Key insights

Biomedical VLMs exhibit pervasive perceptual weaknesses, especially in object detection and open-ended tasks, despite medical adaptation.

Principles

Method

MMBU's creation involves a four-stage, metadata-driven, human-in-the-loop pipeline: task demonstration, fine-grained metadata collection, question template construction, and expert validation for clinically grounded prompts.

In practice

Topics

Best for: AI Scientist, Research Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.