When Search Goes Wrong: Red-Teaming Web-Augmented Large Language Models

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, extended

Summary

CREST-Search is a novel red-teaming framework designed to systematically uncover safety vulnerabilities, particularly "citation risks," in Large Language Models (LLMs) integrated with web search functionality. Traditional red-teaming methods, which focus on standalone LLMs, are insufficient for these systems due to the complex interplay of reasoning, generation, and external tool use. CREST-Search employs three attack strategies—keyword injection, exaggeration, and role play—to generate adversarial search queries. It leverages in-context learning for iterative refinement of these queries and introduces WebSearch-Harm, a search-specific harmful dataset used to fine-tune a specialized red-teaming model. Experiments on four commercial LLMs (GPT-4o-search-preview, GPT-4o-mini-search-preview, Gemini-2.0-flash-search, Gemini-2.5-flash-search) demonstrate that CREST-Search achieves an 80.5% risk detection rate, significantly outperforming baselines, with 89.3% of detected risks being citation-related. The framework also maintains low query toxicity (23.6%) and high diversity (0.82 self-BLEU score), proving its effectiveness and efficiency in black-box settings.

Key takeaway

For CTOs and VPs of Engineering deploying LLMs with web search capabilities, recognizing and mitigating "citation risks" is paramount. Your existing safety filters for standalone LLMs are likely inadequate, as CREST-Search demonstrates that 89.3% of vulnerabilities in search-enabled LLMs stem from harmful external citations, even when responses appear benign. Prioritize implementing robust, real-time safety detection for both retrieved URLs and their content, and consider fine-tuning your LLMs with specialized adversarial datasets like WebSearch-Harm to build stronger, more comprehensive defense mechanisms against these subtle yet critical threats.

Key insights

CREST-Search effectively uncovers unique citation risks in web-search-augmented LLMs through specialized red-teaming strategies and a fine-tuned model.

Principles

Method

CREST-Search uses a three-stage pipeline: adversarial query generation (via a fine-tuned model and strategies), web search execution/risk evaluation (using moderation tools), and iterative refinement with judgment feedback.

In practice

Topics

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Security Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.