GAZE: Grounded Agentic Zero-shot Evaluation with Viewer-Level Tools and Literature Retrieval on Rare Brain MRI

2026-05-05 · Source: cs.LG updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Medical Imaging AI · Depth: Expert, extended

Summary

GAZE (Grounded Agentic Zero-shot Evaluation) is a novel framework that enables medical Vision-Language Models (VLMs) to iteratively analyze brain MRI images, mimicking a radiologist's workflow. Unlike traditional single-pass VLMs, GAZE integrates viewer-level tools (e.g., zoom, windowing, contrast, edge detection) and two retrieval tools linked to the U.S. National Library of Medicine (PubMed for literature, Open-i for radiological images). The framework employs structured prompting and schema-validated outputs, with full tool-call traces recorded for auditability. On the NOVA benchmark of 906 brain MRI cases covering 281 rare neurological conditions, GAZE achieved 58.2 mean average precision (mAP) at IoU 0.3 for lesion localization and 34.9% Top-1 diagnostic accuracy under a joint evaluation protocol. The framework itself, through structured prompting, improved the Gemini 2.0 Flash baseline from 20.2 to 29.4 mAP@0.3, and tool use disproportionately benefited rare pathologies, increasing IoU >0.3 cases from 17% to 58% for conditions with three or fewer examples.

Key takeaway

For Computer Vision Engineers developing medical AI, GAZE demonstrates that integrating iterative tool use and external knowledge retrieval significantly improves VLM performance on complex tasks like rare brain MRI diagnosis. You should prioritize joint evaluation protocols that score captioning, diagnosis, and localization simultaneously, as this reveals critical trade-offs, such as retrieval-induced localization degradation, that single-task evaluations miss. Consider adopting structured prompting and schema-validated outputs to enhance model reliability and auditability in clinical applications.

Key insights

Agentic VLMs with iterative tool use and retrieval enhance medical image interpretation, especially for rare conditions.

Principles

Medical VLM evaluation needs joint scoring of captioning, diagnosis, and localization.
Framework design, including prompting and schema validation, significantly impacts VLM performance.
Tool benefit correlates with model engagement and ability to integrate diverse evidence.

Method

GAZE uses a VLM agent to iteratively call viewer-level visual tools and PubMed/Open-i retrieval, validating structured outputs against a JSON schema and logging all tool interactions for auditability.

In practice

Implement schema-constrained outputs for medical VLM tasks.
Integrate viewer-level tools for iterative image analysis.
Evaluate retrieval systems with multi-task metrics to detect trade-offs.

Topics

GAZE Framework
Medical Vision-Language Models
Rare Brain MRI
Agentic Systems
Viewer-Level Tools

Best for: Computer Vision Engineer, AI Scientist, Research Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.