Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents

· Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

The MT-AgentRisk benchmark and ToolShield defense address the growing safety gap in LLM-based agents, particularly in multi-turn, tool-using interactions. MT-AgentRisk, the first benchmark of its kind, uses a principled taxonomy to transform 365 single-turn harmful tasks into multi-turn attack sequences across five tools: Filesystem-MCP, Browser, PostgreSQL-MCP, Notion-MCP, and Terminal. Evaluations on models like Claude-4.5-Sonnet, Qwen3-Coder, and Seed-1.6 revealed substantial safety degradation, with the Attack Success Rate (ASR) increasing by 16% on average. Claude-4.5-Sonnet's ASR rose by 27%, Qwen3-Coder's by 23%, and Seed-1.6's by 12%. To counter this, ToolShield, a training-free, self-exploration defense, autonomously generates test cases, executes them in a sandbox, and distills safety experiences. ToolShield effectively reduced ASR by 30% on average in multi-turn settings, with Claude-4.5-Sonnet showing a 50% reduction.

Key takeaway

For AI Security Engineers deploying LLM agents, recognize that multi-turn interactions significantly increase attack success rates, even for highly capable models. You should implement proactive, self-exploration defenses like ToolShield to identify and mitigate functional risks before deployment. This approach allows agents to learn from simulated mistakes, reducing Attack Success Rate by up to 50% without costly retraining or degrading benign task performance. Prioritize defenses that generate transferable safety experiences across diverse tools and models.

Key insights

Multi-turn, tool-using agent interactions significantly degrade safety, necessitating proactive, self-exploring defenses.

Principles

Method

ToolShield's self-exploration defense synthesizes test cases from tool documentation, executes them in a sandbox, and distills safety experiences for deployment, without retraining.

In practice

Topics

Code references

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.