SafeClawBench: Separating Semantic, Audit-Evidence, and Sandbox Harm in Tool-Using LLM Agents

2026-06-16 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, quick

Summary

SafeClawBench is a new staged benchmark designed to evaluate security failures in tool-using language model agents, moving beyond mere unsafe text to include actions like disclosing protected objects, modifying databases, or triggering harmful code. It features 600 controlled adversarial tasks across six attack families, including direct and indirect prompt injection, tool-return injection, memory poisoning, memory extraction, and ambiguity-driven unsafe inference. The benchmark reports three distinct endpoints: semantic attack acceptance, audit-visible harm evidence, and sandbox-observed tool/state harm. Evaluations show semantic failure rates vary from 9.0% to 44.2% across models, and notably, 291 of 347 observed sandbox harms occurred despite passing semantic checks. SafeClawBench provides a reproducible framework for comparing agent models and prompt-policy conditions, with its open-source dataset available at https://huggingface.co/datasets/sairights/safeclawbench.

Key takeaway

For AI Security Engineers deploying tool-using LLM agents, relying solely on semantic compliance metrics is insufficient. You must differentiate between an agent's textual agreement with an attack and its actual ability to cause observable, executable harm. Integrate staged security benchmarks like SafeClawBench into your evaluation pipeline to accurately assess risks, especially since significant sandbox harms can occur even when semantic checks pass.

Key insights

Tool-using LLM agent security requires distinguishing semantic compliance from observable, executable harm.

Principles

Harm endpoints capture distinct failure modes
Prompt policies' effects depend on model and protocol

Method

A staged benchmark evaluates tool-using LLM agents across semantic acceptance, audit-visible evidence, and sandbox-observed tool/state harm.

In practice

Use SafeClawBench dataset for agent security testing
Compare agent models under various prompt policies

Topics

LLM Agents
Tool-Using LLMs
Security Benchmarking
Prompt Injection
Memory Poisoning
Data Security

Best for: Research Scientist, AI Architect, Machine Learning Engineer, AI Scientist, AI Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.