SafeClawBench: Separating Semantic, Audit-Evidence, and Sandbox Harm in Tool-Using LLM Agents
Summary
SafeClawBench is a new staged benchmark designed to evaluate security failures in tool-using language model agents, moving beyond mere unsafe text to include actions like disclosing protected objects, modifying databases, or triggering harmful code. It features 600 controlled adversarial tasks across six attack families, including direct and indirect prompt injection, tool-return injection, memory poisoning, memory extraction, and ambiguity-driven unsafe inference. The benchmark reports three distinct endpoints: semantic attack acceptance, audit-visible harm evidence, and sandbox-observed tool/state harm. Evaluations show semantic failure rates vary from 9.0% to 44.2% across models, and notably, 291 of 347 observed sandbox harms occurred despite passing semantic checks. SafeClawBench provides a reproducible framework for comparing agent models and prompt-policy conditions, with its open-source dataset available at https://huggingface.co/datasets/sairights/safeclawbench.
Key takeaway
For AI Security Engineers deploying tool-using LLM agents, relying solely on semantic compliance metrics is insufficient. You must differentiate between an agent's textual agreement with an attack and its actual ability to cause observable, executable harm. Integrate staged security benchmarks like SafeClawBench into your evaluation pipeline to accurately assess risks, especially since significant sandbox harms can occur even when semantic checks pass.
Key insights
Tool-using LLM agent security requires distinguishing semantic compliance from observable, executable harm.
Principles
- Harm endpoints capture distinct failure modes
- Prompt policies' effects depend on model and protocol
Method
A staged benchmark evaluates tool-using LLM agents across semantic acceptance, audit-visible evidence, and sandbox-observed tool/state harm.
In practice
- Use SafeClawBench dataset for agent security testing
- Compare agent models under various prompt policies
Topics
- LLM Agents
- Tool-Using LLMs
- Security Benchmarking
- Prompt Injection
- Memory Poisoning
- Data Security
Best for: Research Scientist, AI Architect, Machine Learning Engineer, AI Scientist, AI Engineer, AI Security Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.