Which Defense Closes Which Threat? Attributing OWASP-LLM-Top-10 Coverage and Its Brittleness Under Paraphrasing

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Advanced, quick

Summary

A study investigated the effectiveness and brittleness of various defense mechanisms against OWASP-LLM-Top-10 threats in production LLM applications. Researchers used a 21-agent scanner against four synthetic LLM endpoints: $L_0$ (no defenses), $L_1$ (refusal-phrase filters), $L_2$ (token-budget controls), and $L_3$ (full stack including tool-registry authentication and credential scrubbing). Findings from $N=10$ replications show refusal filters alone eliminated LLM01 (jailbreak) and LLM07 (system-prompt leakage), while budget controls alone removed LLM02 (sensitive-info disclosure) and LLM10 (unbounded consumption). LLM06 (excessive agency) required the full $L_3$ stack. The study also probed brittleness under paraphrasing, revealing that 300 Gemini-generated paraphrases reduced $L_1$'s refusal block rate by 15 pp on LLM01 and 25 pp on LLM07, whereas budget controls showed no drop (0 pp). A fifth target, $L_4$-real, using Gemini-2.5-flash with the $L_3$ regex, matched $L_1$ exactly, indicating no measurable alignment contribution beyond the regex for the tested threats.

Key takeaway

For AI Security Engineers designing LLM application defenses, you should prioritize budget controls for their resilience against evolving attack patterns, especially paraphrased inputs. While refusal-phrase filters are effective against specific threats like jailbreaks, their brittleness under LLM-driven paraphrasing means you cannot rely on them alone. Implement a multi-layered defense, ensuring your full stack addresses complex threats like excessive agency, and regularly test your refusal mechanisms with adversarial LLM-generated variations.

Key insights

Refusal filters are vulnerable to paraphrasing, while budget controls maintain effectiveness against LLM-driven attack mutations.

Principles

Method

The study used a 21-agent scanner against LLM endpoints with varying defense stacks ($L_0$ to $L_3$), measuring OWASP-LLM-Top-10 coverage and brittleness under $K=5$ Gemini-generated paraphrases from a 60-template corpus.

In practice

Topics

Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Security Engineer, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.