Thinking Like a Botanist: Challenging Multimodal Language Models with Intent-Driven Chain-of-Inquiry
Summary
Researchers have introduced PlantInquiryVQA, a new benchmark and dataset designed to evaluate Multimodal Large Language Models (MLLMs) on multi-step, intent-driven visual reasoning for botanical disease diagnosis. Unlike existing Visual Question Answering (VQA) benchmarks that focus on single-turn queries, PlantInquiryVQA formalizes a "Chain of Inquiry" (CoI) framework, mimicking how botanists use structured, adaptive questioning based on visual cues and diagnostic intent. The dataset comprises 24,950 expert-curated plant images and 138,068 question-answer pairs, annotated with visual grounding, severity labels, and domain-specific reasoning templates. Evaluations of top-tier MLLMs, including Gemini-3-Flash and Grok-4.1-Fast, revealed that while models describe visual symptoms adequately, they struggle with safe clinical reasoning and accurate diagnosis. Structured question-guided inquiry significantly improved diagnostic correctness, reduced hallucination, and increased reasoning efficiency, with Gemini-3-Flash achieving the highest overall performance but still showing a substantial domain gap in clinical utility.
Key takeaway
For Computer Vision Engineers developing diagnostic AI, this research highlights that integrating structured, intent-driven Chain-of-Inquiry (CoI) frameworks into your MLLM evaluations and designs is crucial. You should move beyond single-turn VQA to build systems that can perform adaptive, multi-step reasoning, as this significantly improves diagnostic accuracy and reduces critical errors like false reassurance. Focus on bridging the gap between raw visual grounding and coherent clinical diagnosis to create more reliable and safer AI agents for real-world applications.
Key insights
Structured, intent-driven inquiry significantly enhances MLLM diagnostic accuracy and reasoning efficiency in botanical pathology.
Principles
- Diagnostic reasoning benefits from sequential, adaptive questioning.
- Visual grounding is distinct from clinical diagnosis.
- False reassurance is the most critical error in phytopathology.
Method
The Chain of Inquiry (CoI) framework models diagnostic trajectories as ordered question-answer sequences, conditioned on grounded visual cues and explicit epistemic intent (Diagnosis, Prognosis, Management).
In practice
- Use question-guided protocols to reduce MLLM hallucination.
- Prioritize safety metrics in high-stakes diagnostic AI.
- Implement multi-turn reasoning for complex visual tasks.
Topics
- PlantInquiryVQA
- Multimodal Large Language Models
- Chain of Inquiry Framework
- Botanical Diagnosis
- Visual Question Answering
Code references
Best for: Computer Vision Engineer, AI Scientist, Research Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.