MERRIN: A Benchmark for Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments
Summary
MERRIN (Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments) is a new human-annotated benchmark designed to evaluate AI search-augmented agents in complex, real-world web environments. It addresses the challenges of underspecified, multi-hop search queries and the multimodal, heterogeneous, and often conflicting nature of web results. MERRIN distinguishes itself by using natural language queries without explicit modality cues, incorporating underexplored modalities like video and audio, and requiring retrieval of complex, noisy multimodal evidence. Evaluations of ten diverse models, including GPT-5.4-mini, Gemini 3/3.1 Flash/Pro, and Qwen3-4B/30B/235B, across no search, native search, and agentic search settings, reveal MERRIN is highly challenging. The average accuracy across all agents is 22.3%, with the top performer achieving only 40.1%. Stronger agents like Gemini Deep Research show modest gains but suffer from over-exploration and distraction by irrelevant content, consuming more resources than humans for lower accuracy due to inefficient source selection and overreliance on text.
Key takeaway
For research scientists developing search-augmented AI agents, MERRIN highlights critical gaps in current capabilities, particularly in multimodal reasoning and efficient evidence retrieval from noisy web sources. You should focus on improving agents' ability to implicitly identify relevant modalities, avoid over-exploration, and enhance source selection beyond text-centric approaches to achieve human-comparable accuracy and resource efficiency.
Key insights
MERRIN challenges AI agents to perform multimodal reasoning and evidence retrieval in noisy web environments.
Principles
- Natural language queries require implicit modality identification.
- Over-exploration can degrade agent performance.
- Multimodal reasoning needs robust source selection.
Method
MERRIN evaluates search-augmented agents by requiring them to identify relevant modalities, retrieve multimodal evidence, and perform multi-hop reasoning over noisy web sources using natural language queries.
In practice
- Integrate video and audio modalities into search agents.
- Develop agents for multi-hop reasoning tasks.
- Prioritize efficient source selection in agent design.
Topics
- MERRIN Benchmark
- Multimodal Evidence Retrieval
- Multi-hop Reasoning
- Search-Augmented Agents
- Noisy Web Environments
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.