GAZE: Grounded Agentic Zero-shot Evaluation with Viewer-Level Tools and Literature Retrieval on Rare Brain MRI
Summary
GAZE (Grounded Agentic Zero-shot Evaluation) is a novel framework that enables medical Vision-Language Models (VLMs) to iteratively analyze brain MRI images, mimicking a radiologist's workflow. Unlike traditional single-pass VLMs, GAZE integrates viewer-level tools (e.g., zoom, windowing, contrast, edge detection) and two retrieval tools linked to the U.S. National Library of Medicine (PubMed for literature, Open-i for radiological images). The framework employs structured prompting and schema-validated outputs, with full tool-call traces recorded for auditability. On the NOVA benchmark of 906 brain MRI cases covering 281 rare neurological conditions, GAZE achieved 58.2 mean average precision (mAP) at IoU 0.3 for lesion localization and 34.9% Top-1 diagnostic accuracy under a joint evaluation protocol. The framework itself, through structured prompting, improved the Gemini 2.0 Flash baseline from 20.2 to 29.4 mAP@0.3, and tool use disproportionately benefited rare pathologies, increasing IoU >0.3 cases from 17% to 58% for conditions with three or fewer examples.
Key takeaway
For Computer Vision Engineers developing medical AI, GAZE demonstrates that integrating iterative tool use and external knowledge retrieval significantly improves VLM performance on complex tasks like rare brain MRI diagnosis. You should prioritize joint evaluation protocols that score captioning, diagnosis, and localization simultaneously, as this reveals critical trade-offs, such as retrieval-induced localization degradation, that single-task evaluations miss. Consider adopting structured prompting and schema-validated outputs to enhance model reliability and auditability in clinical applications.
Key insights
Agentic VLMs with iterative tool use and retrieval enhance medical image interpretation, especially for rare conditions.
Principles
- Medical VLM evaluation needs joint scoring of captioning, diagnosis, and localization.
- Framework design, including prompting and schema validation, significantly impacts VLM performance.
- Tool benefit correlates with model engagement and ability to integrate diverse evidence.
Method
GAZE uses a VLM agent to iteratively call viewer-level visual tools and PubMed/Open-i retrieval, validating structured outputs against a JSON schema and logging all tool interactions for auditability.
In practice
- Implement schema-constrained outputs for medical VLM tasks.
- Integrate viewer-level tools for iterative image analysis.
- Evaluate retrieval systems with multi-task metrics to detect trade-offs.
Topics
- GAZE Framework
- Medical Vision-Language Models
- Rare Brain MRI
- Agentic Systems
- Viewer-Level Tools
Best for: Computer Vision Engineer, AI Scientist, Research Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.