Thinking Like a Botanist: Challenging Multimodal Language Models with Intent-Driven Chain-of-Inquiry

2026-04-24 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Crop Science & Plant Technology · Depth: Expert, extended

Summary

Researchers have introduced PlantInquiryVQA, a new benchmark and dataset designed to evaluate Multimodal Large Language Models (MLLMs) on multi-step, intent-driven visual reasoning for botanical disease diagnosis. Unlike existing Visual Question Answering (VQA) benchmarks that focus on single-turn queries, PlantInquiryVQA formalizes a "Chain of Inquiry" (CoI) framework, mimicking how botanists use structured, adaptive questioning based on visual cues and diagnostic intent. The dataset comprises 24,950 expert-curated plant images and 138,068 question-answer pairs, annotated with visual grounding, severity labels, and domain-specific reasoning templates. Evaluations of top-tier MLLMs, including Gemini-3-Flash and Grok-4.1-Fast, revealed that while models describe visual symptoms adequately, they struggle with safe clinical reasoning and accurate diagnosis. Structured question-guided inquiry significantly improved diagnostic correctness, reduced hallucination, and increased reasoning efficiency, with Gemini-3-Flash achieving the highest overall performance but still showing a substantial domain gap in clinical utility.

Key takeaway

For Computer Vision Engineers developing diagnostic AI, this research highlights that integrating structured, intent-driven Chain-of-Inquiry (CoI) frameworks into your MLLM evaluations and designs is crucial. You should move beyond single-turn VQA to build systems that can perform adaptive, multi-step reasoning, as this significantly improves diagnostic accuracy and reduces critical errors like false reassurance. Focus on bridging the gap between raw visual grounding and coherent clinical diagnosis to create more reliable and safer AI agents for real-world applications.

Key insights

Structured, intent-driven inquiry significantly enhances MLLM diagnostic accuracy and reasoning efficiency in botanical pathology.

Principles

Diagnostic reasoning benefits from sequential, adaptive questioning.
Visual grounding is distinct from clinical diagnosis.
False reassurance is the most critical error in phytopathology.

Method

The Chain of Inquiry (CoI) framework models diagnostic trajectories as ordered question-answer sequences, conditioned on grounded visual cues and explicit epistemic intent (Diagnosis, Prognosis, Management).

In practice

Use question-guided protocols to reduce MLLM hallucination.
Prioritize safety metrics in high-stakes diagnostic AI.
Implement multi-turn reasoning for complex visual tasks.

Topics

PlantInquiryVQA
Multimodal Large Language Models
Chain of Inquiry Framework
Botanical Diagnosis
Visual Question Answering

Code references

syed-nazmus-sakib/PlantInquiryVQA

Best for: Computer Vision Engineer, AI Scientist, Research Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.