Domain-Conditioned Safety in Frontier Computer-Using Agents: A 793-Episode Browser Benchmark, a Coding-Domain Cross-Reference, and a Reproducibility Audit of Recent Red-Teaming

2026-06-06 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

The CUA-HandCrafted benchmark, comprising 793 episodes across 24 multi-step web tasks, 56 attack templates, and 8 attack families, evaluates prompt injection safety in frontier computer-using agents. Against Claude Sonnet 4.6 and GPT-5.4, it measured 0/140 multi-step attack success (Clopper–Pearson 95% upper bound 2.60%) using hand-crafted injection techniques in browser environments. This resistance is attributed to model weights, not system prompts. However, this safety is domain-conditioned; the same models achieved up to 100% attack success on a sister coding-agent benchmark, SkillBench, using hand-crafted skill injection. The study argues that high attack success rates (42–98%) reported in prior literature are largely due to RL-optimized injection text, which is often unreleased, rather than the attack categories themselves, leading to reproducibility issues.

Key takeaway

For AI Security Engineers deploying frontier computer-using agents, recognize that browser-domain safety hardening does not generalize to other modalities like coding agents. Your current hand-crafted red-teaming efforts may yield 0% ASR, but this is likely due to the attack phrasing, not inherent model robustness. You should prioritize developing or acquiring RL-optimized attack generation techniques and ensure comprehensive safety evaluations across all deployment surfaces, including coding environments, to accurately assess true vulnerability.

Key insights

Frontier CUA safety is domain-conditioned, with browser resistance not generalizing to coding agents and highly dependent on RL-optimized attack phrasing.

Principles

CUA safety hardening is domain-conditioned.
RL-optimized phrasing drives high attack success rates.
Published ASRs are unreproducible without optimized strings.

Method

CUA-HandCrafted uses 793 episodes, 24 multi-step web tasks, 56 hand-crafted attack templates, and canary detection to measure ASR against frontier CUAs.

In practice

Evaluate CUA safety across all deployment surfaces.
Focus red-teaming on RL-discovered attack phrasing.
Require optimized attack strings for ASR reproducibility.

Topics

Computer-Using Agents
Prompt Injection
Red Teaming
CUA-HandCrafted Benchmark
Domain-Conditioned Safety
RL-Optimized Attacks
Reproducibility

Code references

RPC2/AutoInject

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Security Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.