MMBU: A Massive Multi-modal Biomedical Understanding Benchmark to Probe the Perception Capabilities of Vision-Language Models

2026-06-08 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Biomedical Imaging & Diagnostics · Depth: Expert, extended

Summary

The Massive Multimodal Biomedical Understanding (MMBU) benchmark is introduced as the largest biomedical vision and language benchmark to date, covering 35 submodalities across 11 modalities, 20 specimens, and 95 regions of interest from 410 datasets. It evaluates 15 open-weight and 2 frontier Vision-Language Models (VLMs) on ungrounded classification, grounded classification, and object detection tasks. Findings reveal that while medical adaptation offers modest gains for some models, high accuracy on established benchmarks often masks significant deficiencies in visual perception and domain generalization. Absolute F1 scores remain low, with the best closed-format result at 0.693, and a large average performance gap of 0.26 between closed and open-ended tasks. Object detection is a critical failure mode, with no VLM surpassing the random baseline (F1 = 0.172) in closed settings.

Key takeaway

For AI Scientists and Machine Learning Engineers developing biomedical VLMs, you should critically assess model robustness beyond established benchmarks. MMBU reveals that current models struggle with fine-grained visual perception, especially in object detection and open-ended tasks, where performance often falls below adequate thresholds. Focus your development efforts on improving spatial reasoning and creating adaptation strategies that demonstrate consistent generalization across diverse biomedical modalities and contexts, rather than optimizing solely for narrow, potentially data-polluted datasets.

Key insights

Biomedical VLMs exhibit pervasive perceptual weaknesses, especially in object detection and open-ended tasks, despite medical adaptation.

Principles

VLM performance degrades significantly in open-ended tasks.
Medical adaptation offers limited, inconsistent gains.
Object detection is a major VLM failure mode.

Method

MMBU's creation involves a four-stage, metadata-driven, human-in-the-loop pipeline: task demonstration, fine-grained metadata collection, question template construction, and expert validation for clinically grounded prompts.

In practice

Evaluate VLMs on diverse, fine-grained biomedical tasks.
Prioritize spatial modeling for improved object detection.
Develop adaptation methods that generalize beyond existing benchmarks.

Topics

Biomedical Vision-Language Models
Multimodal Benchmarking
Medical Imaging Analysis
Object Detection
Domain Generalization
Visual Question Answering

Best for: AI Scientist, Research Scientist, Machine Learning Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.