OSGuard: A Benchmark for Safety in Computer-Use Agents
Summary
OSGuard is a new dual-granularity benchmark suite designed to evaluate safety in computer-use agents, addressing the limitation of current evaluations that prioritize task success over potential unsafe shortcuts. Introduced to assess agents under benign user instructions, OSGuard comprises two main components. First, an action-level benchmark assesses local guardrail decisions by classifying proposed actions as allowed, unrelated, or unsafe, relative to the original instruction and current interface state. Second, a risk-augmented execution suite features manually constructed OSWorld-derived task variants that introduce latent hazards, such as destructive overwrites, while maintaining the original task's achievability. This suite uses augmented evaluators to differentiate safe completions from unsafe ones that still meet the nominal task objective. Experimental results on OSGuard indicate that existing multimodal guardrails perform adequately on isolated action judgments but reveal significant gaps in reliable end-to-end safety when deployed in full-task scenarios. This dual-granularity approach facilitates precise diagnosis of model capabilities in recognizing unsafe actions and enhancing overall task safety.
Key takeaway
For AI Engineers developing computer-use agents, relying solely on task completion metrics is insufficient for safety. You must integrate dual-granularity benchmarks like OSGuard. This helps diagnose if your models recognize unsafe actions and maintain full-task safety. Prioritize testing with risk-augmented environments to expose latent hazards. This ensures robust end-to-end agent safety, moving beyond isolated guardrail performance.
Key insights
OSGuard's dual-granularity benchmark reveals current guardrails excel locally but fail to ensure end-to-end safety in computer-use agents.
Principles
- Task success alone is insufficient for agent safety evaluation.
- Local action safety does not guarantee end-to-end safety.
- Dual-granularity testing improves safety diagnosis.
Method
OSGuard employs an action-level benchmark for local guardrail decisions and a risk-augmented execution suite with OSWorld-derived tasks and augmented evaluators to identify unsafe completions.
In practice
- Integrate action-level safety checks.
- Design tasks with latent hazards.
- Augment evaluators for safety invariants.
Topics
- Computer-Use Agents
- Agent Safety Benchmarking
- Multimodal Guardrails
- OSGuard Benchmark
- End-to-End Safety
- Risk-Augmented Execution
Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Engineer, AI Security Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.