AgentSentry: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification

2026-02-26 · Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, medium

Summary

AgentSentry is a new inference-time defense framework designed to mitigate indirect prompt injection (IPI) in large language model (LLM) agents that utilize external tools and retrieval systems. IPI attacks involve attacker-controlled context silently steering agent actions over multi-turn trajectories, which existing heuristic-based defenses often mishandle by prematurely terminating workflows. AgentSentry models IPI as a temporal causal takeover, localizing these takeover points through controlled counterfactual re-executions at tool-return boundaries. It then purifies the context by removing attack-induced deviations while preserving task-relevant information. Evaluated on the AgentDojo benchmark across four task suites, three IPI attack families, and multiple black-box LLMs, AgentSentry successfully eliminates attacks and achieves an average Utility Under Attack (UA) of 74.55%, improving UA by 20.8 to 33.6 percentage points over leading baselines without degrading benign performance.

Key takeaway

For AI Scientists developing or deploying LLM agents with external tools, AgentSentry offers a robust defense against indirect prompt injection. You should consider integrating its temporal causal diagnostics and context purification methods to enhance security. This approach significantly improves agent utility under attack while maintaining benign performance, addressing a critical vulnerability in autonomous agent design.

Key insights

AgentSentry mitigates indirect prompt injection in LLM agents by modeling it as a temporal causal takeover and purifying context.

Principles

IPI is a temporal causal takeover.
Counterfactual re-execution localizes attack points.

Method

AgentSentry uses controlled counterfactual re-executions at tool-return boundaries to localize IPI takeover points, followed by causally guided context purification to remove malicious deviations.

In practice

Apply temporal causal diagnostics for multi-turn IPI.
Implement context purification to preserve utility.

Topics

LLM Agents
Indirect Prompt Injection
Inference-Time Defense
Causal Diagnostics
Context Purification

Best for: AI Scientist, Research Scientist, CTO, AI Researcher, AI Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.