Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents

2026-01-17 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

The MT-AgentRisk benchmark and ToolShield defense address the growing safety gap in LLM-based agents, particularly in multi-turn, tool-using interactions. MT-AgentRisk, the first benchmark of its kind, uses a principled taxonomy to transform 365 single-turn harmful tasks into multi-turn attack sequences across five tools: Filesystem-MCP, Browser, PostgreSQL-MCP, Notion-MCP, and Terminal. Evaluations on models like Claude-4.5-Sonnet, Qwen3-Coder, and Seed-1.6 revealed substantial safety degradation, with the Attack Success Rate (ASR) increasing by 16% on average. Claude-4.5-Sonnet's ASR rose by 27%, Qwen3-Coder's by 23%, and Seed-1.6's by 12%. To counter this, ToolShield, a training-free, self-exploration defense, autonomously generates test cases, executes them in a sandbox, and distills safety experiences. ToolShield effectively reduced ASR by 30% on average in multi-turn settings, with Claude-4.5-Sonnet showing a 50% reduction.

Key takeaway

For AI Security Engineers deploying LLM agents, recognize that multi-turn interactions significantly increase attack success rates, even for highly capable models. You should implement proactive, self-exploration defenses like ToolShield to identify and mitigate functional risks before deployment. This approach allows agents to learn from simulated mistakes, reducing Attack Success Rate by up to 50% without costly retraining or degrading benign task performance. Prioritize defenses that generate transferable safety experiences across diverse tools and models.

Key insights

Multi-turn, tool-using agent interactions significantly degrade safety, necessitating proactive, self-exploring defenses.

Principles

Multi-turn interactions amplify agent vulnerability.
Stronger capabilities do not imply better safety.
Environment-targeting attacks are highly effective.

Method

ToolShield's self-exploration defense synthesizes test cases from tool documentation, executes them in a sandbox, and distills safety experiences for deployment, without retraining.

In practice

Generate multi-turn attack sequences using the Addition or Decomposition taxonomy.
Proactively explore new tools in a sandbox to identify harm patterns.
Inject learned safety experiences into agent context for deployment.

Topics

LLM Agents
Multi-Turn Safety
Agent Safety Benchmarking
Tool-Using Agents
ToolShield Defense
Attack Success Rate

Code references

CHATS-lab/ToolShield

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Engineer, AI Security Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.