GAZE: Grounded Agentic Zero-shot Evaluation with Viewer-Level Tools and Literature Retrieval on Rare Brain MRI

· Source: cs.LG updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Medical Imaging AI · Depth: Expert, extended

Summary

GAZE (Grounded Agentic Zero-shot Evaluation) is a novel framework that enables medical Vision-Language Models (VLMs) to iteratively analyze brain MRI images, mimicking a radiologist's workflow. Unlike traditional single-pass VLMs, GAZE integrates viewer-level tools (e.g., zoom, windowing, contrast, edge detection) and two retrieval tools linked to the U.S. National Library of Medicine (PubMed for literature, Open-i for radiological images). The framework employs structured prompting and schema-validated outputs, with full tool-call traces recorded for auditability. On the NOVA benchmark of 906 brain MRI cases covering 281 rare neurological conditions, GAZE achieved 58.2 mean average precision (mAP) at IoU 0.3 for lesion localization and 34.9% Top-1 diagnostic accuracy under a joint evaluation protocol. The framework itself, through structured prompting, improved the Gemini 2.0 Flash baseline from 20.2 to 29.4 mAP@0.3, and tool use disproportionately benefited rare pathologies, increasing IoU >0.3 cases from 17% to 58% for conditions with three or fewer examples.

Key takeaway

For Computer Vision Engineers developing medical AI, GAZE demonstrates that integrating iterative tool use and external knowledge retrieval significantly improves VLM performance on complex tasks like rare brain MRI diagnosis. You should prioritize joint evaluation protocols that score captioning, diagnosis, and localization simultaneously, as this reveals critical trade-offs, such as retrieval-induced localization degradation, that single-task evaluations miss. Consider adopting structured prompting and schema-validated outputs to enhance model reliability and auditability in clinical applications.

Key insights

Agentic VLMs with iterative tool use and retrieval enhance medical image interpretation, especially for rare conditions.

Principles

Method

GAZE uses a VLM agent to iteratively call viewer-level visual tools and PubMed/Open-i retrieval, validating structured outputs against a JSON schema and logging all tool interactions for auditability.

In practice

Topics

Best for: Computer Vision Engineer, AI Scientist, Research Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.