Evaluating Prompting-Based Defenses Against Domain-Camouflaged Injection Attacks

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Intermediate, long

Summary

The study evaluates five prompting-based defenses (spotlighting, paraphrasing, prompt sandwiching, and two combinations) against domain-camouflaged injection attacks. These attacks embed malicious instructions using domain-appropriate vocabulary, bypassing standard syntactic detectors. Across 3,510 trials involving Claude Haiku, Llama 3.1 8B, and Gemini 2.0 Flash models, and financial, legal, and general deployment domains, paraphrasing retrieved content proved most effective. It reduced camouflage attack success rates by 55–84% and outperformed Llama Guard 4. Defense effectiveness varied significantly by model; for instance, spotlighting halved attack success on Claude Haiku but offered no benefit on Llama 3.1 8B. Financial domain deployments exhibited the highest residual risk, with 26–33% baseline attack success, indicating no single prompting defense fully eliminates the threat on weaker models. This research provides the first systematic evaluation of these defenses against camouflage-class injection attacks.

Key takeaway

For MLOps Engineers deploying LLM agents in enterprise settings, especially with RAG systems, you should prioritize implementing paraphrasing as a primary defense against domain-camouflaged injection attacks. This method consistently reduces attack success rates more effectively than Llama Guard 4 and other prompting techniques. However, always validate defense efficacy on your specific model and domain, as effectiveness is highly model-dependent. Financial sector deployments, in particular, require additional architectural controls due to persistent residual risk.

Key insights

Paraphrasing retrieved content is the most effective prompting defense against domain-camouflaged injection attacks.

Principles

Method

The study systematically evaluated five prompting-based defenses (spotlighting, paraphrasing, prompt sandwiching, and combinations) against domain-camouflaged injection attacks across three model families and three deployment domains using 3,510 trials.

In practice

Topics

Code references

Best for: AI Architect, Machine Learning Engineer, NLP Engineer, AI Security Engineer, MLOps Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.