ReXSonoVQA: A Video QA Benchmark for Procedure-Centric Ultrasound Understanding
Summary
ReXSonoVQA is a new video-based question-answering benchmark designed to evaluate Vision-Language Models' (VLMs) understanding of procedure-centric ultrasound scanning techniques. It comprises 514 video clips from publicly available YouTube instructional videos, paired with 514 questions (249 multiple-choice, 265 free-response). The benchmark targets three core competencies: Action-Goal Reasoning, Artifact Resolution & Optimization, and Procedure Context & Planning, spanning six clinical categories. Zero-shot evaluation of Gemini 3 Pro, Qwen3.5-397B, LLaVA-Video-72B, and Seed 2.0 Pro revealed that while VLMs can extract some procedural information, troubleshooting questions (Type 2) remain particularly challenging, showing minimal gains over text-only baselines. Performance generally improved with longer clip durations, and free-response questions demonstrated a stronger dependence on visual evidence compared to MCQs.
Key takeaway
For research scientists developing perception systems for autonomous ultrasound or real-time guidance, ReXSonoVQA provides a critical benchmark to assess dynamic procedural understanding. You should prioritize developing VLMs with robust causal reasoning capabilities, especially for troubleshooting scenarios (Type 2 questions), as current models show significant limitations. The benchmark's emphasis on video-informed free-response questions highlights the need for models that genuinely leverage visual evidence over textual cues.
Key insights
ReXSonoVQA evaluates VLM understanding of dynamic ultrasound procedures, revealing limitations in causal troubleshooting reasoning.
Principles
- Dynamic procedural understanding requires temporal and causal reasoning.
- Longer video clips provide richer context for procedural reasoning.
- Free-response questions show stronger dependence on visual evidence.
Method
ReXSonoVQA's construction pipeline involves task definition, data curation from YouTube videos, LLM-assisted ground truth event log creation, and an iterative quality control loop with blind solvability screening and distractor refinement.
In practice
- Use ReXSonoVQA to benchmark VLMs for ultrasound training and automation.
- Focus VLM development on improving causal troubleshooting reasoning.
- Prioritize native video input for VLMs in medical imaging tasks.
Topics
- ReXSonoVQA
- Video Question Answering
- Ultrasound Imaging
- Vision-Language Models
- Procedural Understanding
Code references
Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.