MMBU: A Massive Multi-modal Biomedical Understanding Benchmark to Probe the Perception Capabilities of Vision-Language Models
Summary
The Massive Multimodal Biomedical Understanding (MMBU) benchmark is introduced as the largest biomedical vision and language benchmark to date, covering 35 submodalities across 11 modalities, 20 specimens, and 95 regions of interest from 410 datasets. It evaluates 15 open-weight and 2 frontier Vision-Language Models (VLMs) on ungrounded classification, grounded classification, and object detection tasks. Findings reveal that while medical adaptation offers modest gains for some models, high accuracy on established benchmarks often masks significant deficiencies in visual perception and domain generalization. Absolute F1 scores remain low, with the best closed-format result at 0.693, and a large average performance gap of 0.26 between closed and open-ended tasks. Object detection is a critical failure mode, with no VLM surpassing the random baseline (F1 = 0.172) in closed settings.
Key takeaway
For AI Scientists and Machine Learning Engineers developing biomedical VLMs, you should critically assess model robustness beyond established benchmarks. MMBU reveals that current models struggle with fine-grained visual perception, especially in object detection and open-ended tasks, where performance often falls below adequate thresholds. Focus your development efforts on improving spatial reasoning and creating adaptation strategies that demonstrate consistent generalization across diverse biomedical modalities and contexts, rather than optimizing solely for narrow, potentially data-polluted datasets.
Key insights
Biomedical VLMs exhibit pervasive perceptual weaknesses, especially in object detection and open-ended tasks, despite medical adaptation.
Principles
- VLM performance degrades significantly in open-ended tasks.
- Medical adaptation offers limited, inconsistent gains.
- Object detection is a major VLM failure mode.
Method
MMBU's creation involves a four-stage, metadata-driven, human-in-the-loop pipeline: task demonstration, fine-grained metadata collection, question template construction, and expert validation for clinically grounded prompts.
In practice
- Evaluate VLMs on diverse, fine-grained biomedical tasks.
- Prioritize spatial modeling for improved object detection.
- Develop adaptation methods that generalize beyond existing benchmarks.
Topics
- Biomedical Vision-Language Models
- Multimodal Benchmarking
- Medical Imaging Analysis
- Object Detection
- Domain Generalization
- Visual Question Answering
Best for: AI Scientist, Research Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.