Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents
Summary
The MT-AgentRisk benchmark and ToolShield defense address the growing safety gap in LLM-based agents, particularly in multi-turn, tool-using interactions. MT-AgentRisk, the first benchmark of its kind, uses a principled taxonomy to transform 365 single-turn harmful tasks into multi-turn attack sequences across five tools: Filesystem-MCP, Browser, PostgreSQL-MCP, Notion-MCP, and Terminal. Evaluations on models like Claude-4.5-Sonnet, Qwen3-Coder, and Seed-1.6 revealed substantial safety degradation, with the Attack Success Rate (ASR) increasing by 16% on average. Claude-4.5-Sonnet's ASR rose by 27%, Qwen3-Coder's by 23%, and Seed-1.6's by 12%. To counter this, ToolShield, a training-free, self-exploration defense, autonomously generates test cases, executes them in a sandbox, and distills safety experiences. ToolShield effectively reduced ASR by 30% on average in multi-turn settings, with Claude-4.5-Sonnet showing a 50% reduction.
Key takeaway
For AI Security Engineers deploying LLM agents, recognize that multi-turn interactions significantly increase attack success rates, even for highly capable models. You should implement proactive, self-exploration defenses like ToolShield to identify and mitigate functional risks before deployment. This approach allows agents to learn from simulated mistakes, reducing Attack Success Rate by up to 50% without costly retraining or degrading benign task performance. Prioritize defenses that generate transferable safety experiences across diverse tools and models.
Key insights
Multi-turn, tool-using agent interactions significantly degrade safety, necessitating proactive, self-exploring defenses.
Principles
- Multi-turn interactions amplify agent vulnerability.
- Stronger capabilities do not imply better safety.
- Environment-targeting attacks are highly effective.
Method
ToolShield's self-exploration defense synthesizes test cases from tool documentation, executes them in a sandbox, and distills safety experiences for deployment, without retraining.
In practice
- Generate multi-turn attack sequences using the Addition or Decomposition taxonomy.
- Proactively explore new tools in a sandbox to identify harm patterns.
- Inject learned safety experiences into agent context for deployment.
Topics
- LLM Agents
- Multi-Turn Safety
- Agent Safety Benchmarking
- Tool-Using Agents
- ToolShield Defense
- Attack Success Rate
Code references
Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Engineer, AI Security Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.