AutoDojo: Adaptive Attacks Expose Superficial Defenses and User-Underspecification Limits in LLM Agents
Summary
AutoDojo is an adaptive attack framework designed to expose the limitations of current defenses against Indirect Prompt Injection (IPI) in LLM-powered agents. It extends the static AgentDojo benchmark by iteratively optimizing IPI attacks against a given defense using a frontier LLM. The research demonstrates that many state-of-the-art defenses offer only limited protection; adaptive attacks significantly raise the Attack Success Rate (ASR) compared to static injections. For instance, AutoDojo recovered 28% overall ASR and 64% on action-open tasks against a filter that reduced static ASR to 0%. The study also highlights a structural vulnerability in "action-open" tasks, where user requests delegate actions to attacker-controlled content, allowing injections to bypass defenses by posing as ordinary data.
Key takeaway
For AI Security Engineers evaluating LLM agent defenses, AutoDojo demonstrates that static benchmarks provide a false sense of security. You should assume prompt-based and detection-based defenses are vulnerable to adaptive indirect prompt injection, especially for action-open tasks. Prioritize developing system-level defenses or re-architecting agent tasks to minimize user underspecification to truly mitigate this threat, as current approaches are easily bypassed by optimized attacks.
Key insights
AutoDojo reveals LLM agent defenses are superficial against adaptive IPI, especially on underspecified tasks.
Principles
- Static benchmarks fail to assess adaptive threat robustness.
- Adaptive attacks significantly bypass current IPI defenses.
- Action-open tasks are structurally vulnerable to IPI.
Method
AutoDojo optimizes Indirect Prompt Injection (IPI) against a given defense by using a frontier LLM to iteratively refine the attack, extending the AgentDojo framework.
In practice
- Evaluate LLM agent defenses with adaptive attack frameworks.
- Prioritize defense for action-open LLM agent tasks.
- Rethink prompt-based and detection-based IPI defenses.
Topics
- Indirect Prompt Injection
- LLM Agents
- Adaptive Attacks
- AI Security
- Security Benchmarking
- AutoDojo
Code references
Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Security Engineer, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.