Knowing When Not to Answer: Evaluating Abstention in Multimodal Reasoning Systems
Summary
A new benchmark, MM-AQA, has been introduced to evaluate effective abstention (EA) in multimodal reasoning systems, specifically vision-language models (VLMs) and multi-agent systems (MAS). MM-AQA creates unanswerable instances by transforming answerable ones based on visual modality dependency and evidence sufficiency. Evaluating three frontier VLMs (both closed and open-source) and two MAS architectures across 2079 samples revealed several key findings. Under standard prompting, VLMs rarely abstain, with simple confidence baselines outperforming this setup. While MAS improves abstention, it introduces an accuracy-abstention trade-off. Sequential MAS designs performed comparably to or better than iterative variants, suggesting miscalibration, not reasoning depth, as the primary bottleneck. Models tend to abstain when evidence is entirely absent but attempt reconciliation with degraded or contradictory evidence.
Key takeaway
For research scientists developing multimodal AI, you should prioritize integrating abstention-aware training into your model development pipeline. Relying solely on advanced prompting or increasing agent count will not yield robust abstention capabilities, as miscalibration is a significant barrier. Focus on explicit training for recognizing evidence insufficiency to enhance system reliability and trustworthiness.
Key insights
Effective abstention in multimodal systems requires abstention-aware training, not just better prompting or more agents.
Principles
- VLMs rarely abstain under standard prompting.
- MAS improves abstention but creates a trade-off.
- Miscalibration, not reasoning depth, limits abstention.
Method
MM-AQA constructs unanswerable multimodal instances by transforming answerable ones along axes of visual modality dependency and evidence sufficiency to evaluate abstention capabilities.
In practice
- Implement abstention-aware training for VLMs.
- Use confidence baselines for initial abstention.
- Prioritize sequential MAS designs over iterative.
Topics
- Multimodal Reasoning Systems
- Effective Abstention
- Vision-Language Models
- Multi-Agent Systems
- MM-AQA Benchmark
Best for: Research Scientist, AI Scientist, NLP Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.