MERRIN: A Benchmark for Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

MERRIN (Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments) is a new human-annotated benchmark designed to evaluate AI search-augmented agents in complex, real-world web environments. It addresses the challenges of underspecified, multi-hop search queries and the multimodal, heterogeneous, and often conflicting nature of web results. MERRIN distinguishes itself by using natural language queries without explicit modality cues, incorporating underexplored modalities like video and audio, and requiring retrieval of complex, noisy multimodal evidence. Evaluations of ten diverse models, including GPT-5.4-mini, Gemini 3/3.1 Flash/Pro, and Qwen3-4B/30B/235B, across no search, native search, and agentic search settings, reveal MERRIN is highly challenging. The average accuracy across all agents is 22.3%, with the top performer achieving only 40.1%. Stronger agents like Gemini Deep Research show modest gains but suffer from over-exploration and distraction by irrelevant content, consuming more resources than humans for lower accuracy due to inefficient source selection and overreliance on text.

Key takeaway

For research scientists developing search-augmented AI agents, MERRIN highlights critical gaps in current capabilities, particularly in multimodal reasoning and efficient evidence retrieval from noisy web sources. You should focus on improving agents' ability to implicitly identify relevant modalities, avoid over-exploration, and enhance source selection beyond text-centric approaches to achieve human-comparable accuracy and resource efficiency.

Key insights

MERRIN challenges AI agents to perform multimodal reasoning and evidence retrieval in noisy web environments.

Principles

Method

MERRIN evaluates search-augmented agents by requiring them to identify relevant modalities, retrieve multimodal evidence, and perform multi-hop reasoning over noisy web sources using natural language queries.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.