ReXSonoVQA: A Video QA Benchmark for Procedure-Centric Ultrasound Understanding

2026-04-21 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Medical Devices & Health Technology, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

ReXSonoVQA is a new video-based question-answering benchmark designed to evaluate Vision-Language Models' (VLMs) understanding of procedure-centric ultrasound scanning techniques. It comprises 514 video clips from publicly available YouTube instructional videos, paired with 514 questions (249 multiple-choice, 265 free-response). The benchmark targets three core competencies: Action-Goal Reasoning, Artifact Resolution & Optimization, and Procedure Context & Planning, spanning six clinical categories. Zero-shot evaluation of Gemini 3 Pro, Qwen3.5-397B, LLaVA-Video-72B, and Seed 2.0 Pro revealed that while VLMs can extract some procedural information, troubleshooting questions (Type 2) remain particularly challenging, showing minimal gains over text-only baselines. Performance generally improved with longer clip durations, and free-response questions demonstrated a stronger dependence on visual evidence compared to MCQs.

Key takeaway

For research scientists developing perception systems for autonomous ultrasound or real-time guidance, ReXSonoVQA provides a critical benchmark to assess dynamic procedural understanding. You should prioritize developing VLMs with robust causal reasoning capabilities, especially for troubleshooting scenarios (Type 2 questions), as current models show significant limitations. The benchmark's emphasis on video-informed free-response questions highlights the need for models that genuinely leverage visual evidence over textual cues.

Key insights

ReXSonoVQA evaluates VLM understanding of dynamic ultrasound procedures, revealing limitations in causal troubleshooting reasoning.

Principles

Dynamic procedural understanding requires temporal and causal reasoning.
Longer video clips provide richer context for procedural reasoning.
Free-response questions show stronger dependence on visual evidence.

Method

ReXSonoVQA's construction pipeline involves task definition, data curation from YouTube videos, LLM-assisted ground truth event log creation, and an iterative quality control loop with blind solvability screening and distractor refinement.

In practice

Use ReXSonoVQA to benchmark VLMs for ultrasound training and automation.
Focus VLM development on improving causal troubleshooting reasoning.
Prioritize native video input for VLMs in medical imaging tasks.

Topics

ReXSonoVQA
Video Question Answering
Ultrasound Imaging
Vision-Language Models
Procedural Understanding

Code references

rajpurkarlab/RexSonoVQA

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.