MERRIN: A Benchmark for Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Data Science & Analytics · Depth: Expert, extended

Summary

MERRIN (Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments) is a new human-annotated benchmark designed to evaluate search-augmented AI agents in realistic web environments. It addresses limitations of prior benchmarks by using natural language queries without explicit modality cues, incorporating underexplored modalities like video and audio, and requiring reasoning over noisy, conflicting multimodal web evidence. The benchmark evaluates diverse agents, including closed-source models like GPT-5.4-mini and Gemini 3/3.1 Flash/Pro, and open-weight models like Qwen3-4B/30B/235B, across three search settings: no search, native search, and agentic search. Results indicate MERRIN is highly challenging, with an average accuracy of 22.3% across all agents and the best-performing agent achieving only 40.1%. Analysis reveals that reasoning, not just search effectiveness, is a critical bottleneck, and agents often over-explore or exhibit a strong bias towards text modalities.

Key takeaway

For research scientists developing search-augmented AI agents, MERRIN highlights that current models struggle significantly with multimodal reasoning and efficient evidence selection in noisy web environments. You should prioritize improving reasoning capabilities and developing agents that can effectively integrate diverse modalities beyond text, rather than merely increasing search queries or visited pages. Consider designing systems that can productively deepen search like humans, rather than over-exploring with diminishing returns.

Key insights

MERRIN challenges AI agents in multimodal reasoning over noisy web data, revealing significant performance gaps compared to humans.

Principles

Method

MERRIN's data collection involves human annotation, multi-round review, and a two-pass verification protocol to ensure non-text modality requirements and prevent text-only shortcuts.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.