AutoDojo: Adaptive Attacks Expose Superficial Defenses and User-Underspecification Limits in LLM Agents

2026-06-13 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

AutoDojo is an adaptive attack framework designed to expose the limitations of current defenses against Indirect Prompt Injection (IPI) in LLM-powered agents. It extends the static AgentDojo benchmark by iteratively optimizing IPI attacks against a given defense using a frontier LLM. The research demonstrates that many state-of-the-art defenses offer only limited protection; adaptive attacks significantly raise the Attack Success Rate (ASR) compared to static injections. For instance, AutoDojo recovered 28% overall ASR and 64% on action-open tasks against a filter that reduced static ASR to 0%. The study also highlights a structural vulnerability in "action-open" tasks, where user requests delegate actions to attacker-controlled content, allowing injections to bypass defenses by posing as ordinary data.

Key takeaway

For AI Security Engineers evaluating LLM agent defenses, AutoDojo demonstrates that static benchmarks provide a false sense of security. You should assume prompt-based and detection-based defenses are vulnerable to adaptive indirect prompt injection, especially for action-open tasks. Prioritize developing system-level defenses or re-architecting agent tasks to minimize user underspecification to truly mitigate this threat, as current approaches are easily bypassed by optimized attacks.

Key insights

AutoDojo reveals LLM agent defenses are superficial against adaptive IPI, especially on underspecified tasks.

Principles

Static benchmarks fail to assess adaptive threat robustness.
Adaptive attacks significantly bypass current IPI defenses.
Action-open tasks are structurally vulnerable to IPI.

Method

AutoDojo optimizes Indirect Prompt Injection (IPI) against a given defense by using a frontier LLM to iteratively refine the attack, extending the AgentDojo framework.

In practice

Evaluate LLM agent defenses with adaptive attack frameworks.
Prioritize defense for action-open LLM agent tasks.
Rethink prompt-based and detection-based IPI defenses.

Topics

Indirect Prompt Injection
LLM Agents
Adaptive Attacks
AI Security
Security Benchmarking
AutoDojo

Code references

xhOwenMa/AutoDojo

Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Security Engineer, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.