When Search Goes Wrong: Red-Teaming Web-Augmented Large Language Models
Summary
CREST-Search is a novel red-teaming framework designed to systematically uncover safety vulnerabilities, particularly "citation risks," in Large Language Models (LLMs) integrated with web search functionality. Traditional red-teaming methods, which focus on standalone LLMs, are insufficient for these systems due to the complex interplay of reasoning, generation, and external tool use. CREST-Search employs three attack strategies—keyword injection, exaggeration, and role play—to generate adversarial search queries. It leverages in-context learning for iterative refinement of these queries and introduces WebSearch-Harm, a search-specific harmful dataset used to fine-tune a specialized red-teaming model. Experiments on four commercial LLMs (GPT-4o-search-preview, GPT-4o-mini-search-preview, Gemini-2.0-flash-search, Gemini-2.5-flash-search) demonstrate that CREST-Search achieves an 80.5% risk detection rate, significantly outperforming baselines, with 89.3% of detected risks being citation-related. The framework also maintains low query toxicity (23.6%) and high diversity (0.82 self-BLEU score), proving its effectiveness and efficiency in black-box settings.
Key takeaway
For CTOs and VPs of Engineering deploying LLMs with web search capabilities, recognizing and mitigating "citation risks" is paramount. Your existing safety filters for standalone LLMs are likely inadequate, as CREST-Search demonstrates that 89.3% of vulnerabilities in search-enabled LLMs stem from harmful external citations, even when responses appear benign. Prioritize implementing robust, real-time safety detection for both retrieved URLs and their content, and consider fine-tuning your LLMs with specialized adversarial datasets like WebSearch-Harm to build stronger, more comprehensive defense mechanisms against these subtle yet critical threats.
Key insights
CREST-Search effectively uncovers unique citation risks in web-search-augmented LLMs through specialized red-teaming strategies and a fine-tuned model.
Principles
- Web search integration amplifies LLM safety risks.
- Citation risks are distinct from response risks.
- Black-box testing requires input-output probing.
Method
CREST-Search uses a three-stage pipeline: adversarial query generation (via a fine-tuned model and strategies), web search execution/risk evaluation (using moderation tools), and iterative refinement with judgment feedback.
In practice
- Implement safety detection on both URLs and content.
- Fine-tune LLMs with adversarial queries for robustness.
- Integrate automated red-teaming into deployment pipelines.
Topics
- Web-Augmented LLMs
- Red-Teaming Frameworks
- Citation Risk
- Adversarial Search Queries
- CREST-Search
Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Security Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.