MERRIN: A Benchmark for Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments
Summary
MERRIN (Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments) is a new human-annotated benchmark designed to evaluate search-augmented AI agents in realistic web environments. It addresses limitations of prior benchmarks by using natural language queries without explicit modality cues, incorporating underexplored modalities like video and audio, and requiring reasoning over noisy, conflicting multimodal web evidence. The benchmark evaluates diverse agents, including closed-source models like GPT-5.4-mini and Gemini 3/3.1 Flash/Pro, and open-weight models like Qwen3-4B/30B/235B, across three search settings: no search, native search, and agentic search. Results indicate MERRIN is highly challenging, with an average accuracy of 22.3% across all agents and the best-performing agent achieving only 40.1%. Analysis reveals that reasoning, not just search effectiveness, is a critical bottleneck, and agents often over-explore or exhibit a strong bias towards text modalities.
Key takeaway
For research scientists developing search-augmented AI agents, MERRIN highlights that current models struggle significantly with multimodal reasoning and efficient evidence selection in noisy web environments. You should prioritize improving reasoning capabilities and developing agents that can effectively integrate diverse modalities beyond text, rather than merely increasing search queries or visited pages. Consider designing systems that can productively deepen search like humans, rather than over-exploring with diminishing returns.
Key insights
MERRIN challenges AI agents in multimodal reasoning over noisy web data, revealing significant performance gaps compared to humans.
Principles
- Natural language queries should lack explicit modality cues.
- Multimodal reasoning requires diverse evidence, including video and audio.
- Web search environments are inherently noisy and conflicting.
Method
MERRIN's data collection involves human annotation, multi-round review, and a two-pass verification protocol to ensure non-text modality requirements and prevent text-only shortcuts.
In practice
- Augment native search with video processing tools for improved accuracy.
- Focus agent development on robust multimodal reasoning capabilities.
- Design agents to avoid over-exploration in noisy web environments.
Topics
- MERRIN Benchmark
- Multimodal Evidence Retrieval
- Multi-hop Reasoning
- Search-Augmented Agents
- Noisy Web Environments
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.