AgentSentry: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification
Summary
AgentSentry is a new inference-time defense framework designed to mitigate indirect prompt injection (IPI) in large language model (LLM) agents that utilize external tools and retrieval systems. IPI attacks involve attacker-controlled context silently steering agent actions over multi-turn trajectories, which existing heuristic-based defenses often mishandle by prematurely terminating workflows. AgentSentry models IPI as a temporal causal takeover, localizing these takeover points through controlled counterfactual re-executions at tool-return boundaries. It then purifies the context by removing attack-induced deviations while preserving task-relevant information. Evaluated on the AgentDojo benchmark across four task suites, three IPI attack families, and multiple black-box LLMs, AgentSentry successfully eliminates attacks and achieves an average Utility Under Attack (UA) of 74.55%, improving UA by 20.8 to 33.6 percentage points over leading baselines without degrading benign performance.
Key takeaway
For AI Scientists developing or deploying LLM agents with external tools, AgentSentry offers a robust defense against indirect prompt injection. You should consider integrating its temporal causal diagnostics and context purification methods to enhance security. This approach significantly improves agent utility under attack while maintaining benign performance, addressing a critical vulnerability in autonomous agent design.
Key insights
AgentSentry mitigates indirect prompt injection in LLM agents by modeling it as a temporal causal takeover and purifying context.
Principles
- IPI is a temporal causal takeover.
- Counterfactual re-execution localizes attack points.
Method
AgentSentry uses controlled counterfactual re-executions at tool-return boundaries to localize IPI takeover points, followed by causally guided context purification to remove malicious deviations.
In practice
- Apply temporal causal diagnostics for multi-turn IPI.
- Implement context purification to preserve utility.
Topics
- LLM Agents
- Indirect Prompt Injection
- Inference-Time Defense
- Causal Diagnostics
- Context Purification
Best for: AI Scientist, Research Scientist, CTO, AI Researcher, AI Engineer, AI Security Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.