Knowing When Not to Answer: Evaluating Abstention in Multimodal Reasoning Systems

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision & Pattern Recognition · Depth: Expert, quick

Summary

A new benchmark, MM-AQA, has been introduced to evaluate effective abstention (EA) in multimodal reasoning systems, specifically vision-language models (VLMs) and multi-agent systems (MAS). MM-AQA creates unanswerable instances by transforming answerable ones based on visual modality dependency and evidence sufficiency. Evaluating three frontier VLMs (both closed and open-source) and two MAS architectures across 2079 samples revealed several key findings. Under standard prompting, VLMs rarely abstain, with simple confidence baselines outperforming this setup. While MAS improves abstention, it introduces an accuracy-abstention trade-off. Sequential MAS designs performed comparably to or better than iterative variants, suggesting miscalibration, not reasoning depth, as the primary bottleneck. Models tend to abstain when evidence is entirely absent but attempt reconciliation with degraded or contradictory evidence.

Key takeaway

For research scientists developing multimodal AI, you should prioritize integrating abstention-aware training into your model development pipeline. Relying solely on advanced prompting or increasing agent count will not yield robust abstention capabilities, as miscalibration is a significant barrier. Focus on explicit training for recognizing evidence insufficiency to enhance system reliability and trustworthiness.

Key insights

Effective abstention in multimodal systems requires abstention-aware training, not just better prompting or more agents.

Principles

Method

MM-AQA constructs unanswerable multimodal instances by transforming answerable ones along axes of visual modality dependency and evidence sufficiency to evaluate abstention capabilities.

In practice

Topics

Best for: Research Scientist, AI Scientist, NLP Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.