Search-Time Contamination in Deep Research Agents: Measuring Performance Inflation in Public Benchmark Evaluation

2026-05-18 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Data Science & Analytics · Depth: Expert, extended

Summary

The paper "Search-Time Contamination in Deep Research Agents" identifies a critical vulnerability where large language models (LLMs) equipped with web search capabilities, termed deep research agents, inadvertently retrieve public benchmark artifacts during inference. This phenomenon, Search-Time Contamination (STC), inflates measured performance by allowing agents to bypass intended reasoning. Researchers from Alibaba-NTU and Alibaba Group define three contamination types: Benchmark Metadata Leakage (BML), Question-Context Leakage (QCL), and Explicit Answer Leakage (EAL). Evaluating agents like Tongyi Deep Research and Gemini Deep Research on six medical benchmarks (e.g., MedQA, HLE), they found STC is widespread, inflating performance by up to 4% on HLE biological and chemical subsets. MedMCQA showed nearly 25% of questions with retrievable answers, and Gemini Deep Research had a 60% leakage rate on MedQA. The study advocates for contamination-aware practices, including isolated sandboxes and transparent search trajectories.

Key takeaway

For AI Scientists and Machine Learning Engineers evaluating deep research agents, you must implement contamination-aware practices to ensure accurate performance metrics. Your evaluations should occur within isolated knowledge sandboxes, preventing agents from accessing public benchmark artifacts. Furthermore, demand transparent search trajectories from commercial systems to audit for potential Search-Time Contamination, as web-based answer leakage can significantly inflate reported reasoning capabilities. Strictly control benchmark access to prevent test set exposure.

Key insights

Deep research agents' web search can contaminate benchmark evaluations, inflating performance by bypassing genuine reasoning.

Principles

STC inflates agent performance by up to 4%.
Explicit Answer Leakage (EAL) is the most severe STC type.
Benchmark recency does not eliminate STC risk.

Method

The study defines three STC types (BML, QCL, EAL) and develops detection algorithms: regex URL matching for BML, lexical overlap for QCL, and LLM-as-a-Judge (DeepSeek V4 Pro) for EAL.

In practice

Conduct evaluations in isolated knowledge sandboxes.
Implement transparent search trajectories for auditing.
Control benchmark access with gated systems.

Topics

Deep Research Agents
Search-Time Contamination
LLM Evaluation
Benchmark Integrity
Medical QA Benchmarks
Knowledge Sandboxes

Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.