Domain-Conditioned Safety in Frontier Computer-Using Agents: A 793-Episode Browser Benchmark, a Coding-Domain Cross-Reference, and a Reproducibility Audit of Recent Red-Teaming
Summary
The CUA-HandCrafted benchmark, comprising 793 episodes across 24 multi-step web tasks, 56 attack templates, and 8 attack families, evaluates prompt injection safety in frontier computer-using agents. Against Claude Sonnet 4.6 and GPT-5.4, it measured 0/140 multi-step attack success (Clopper–Pearson 95% upper bound 2.60%) using hand-crafted injection techniques in browser environments. This resistance is attributed to model weights, not system prompts. However, this safety is domain-conditioned; the same models achieved up to 100% attack success on a sister coding-agent benchmark, SkillBench, using hand-crafted skill injection. The study argues that high attack success rates (42–98%) reported in prior literature are largely due to RL-optimized injection text, which is often unreleased, rather than the attack categories themselves, leading to reproducibility issues.
Key takeaway
For AI Security Engineers deploying frontier computer-using agents, recognize that browser-domain safety hardening does not generalize to other modalities like coding agents. Your current hand-crafted red-teaming efforts may yield 0% ASR, but this is likely due to the attack phrasing, not inherent model robustness. You should prioritize developing or acquiring RL-optimized attack generation techniques and ensure comprehensive safety evaluations across all deployment surfaces, including coding environments, to accurately assess true vulnerability.
Key insights
Frontier CUA safety is domain-conditioned, with browser resistance not generalizing to coding agents and highly dependent on RL-optimized attack phrasing.
Principles
- CUA safety hardening is domain-conditioned.
- RL-optimized phrasing drives high attack success rates.
- Published ASRs are unreproducible without optimized strings.
Method
CUA-HandCrafted uses 793 episodes, 24 multi-step web tasks, 56 hand-crafted attack templates, and canary detection to measure ASR against frontier CUAs.
In practice
- Evaluate CUA safety across all deployment surfaces.
- Focus red-teaming on RL-discovered attack phrasing.
- Require optimized attack strings for ASR reproducibility.
Topics
- Computer-Using Agents
- Prompt Injection
- Red Teaming
- CUA-HandCrafted Benchmark
- Domain-Conditioned Safety
- RL-Optimized Attacks
- Reproducibility
Code references
Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Security Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.