OSGuard: A Benchmark for Safety in Computer-Use Agents

2026-06-13 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

OSGuard is a new dual-granularity benchmark suite designed to evaluate safety in computer-use agents, addressing the limitation of current evaluations that prioritize task success over potential unsafe shortcuts. Introduced to assess agents under benign user instructions, OSGuard comprises two main components. First, an action-level benchmark assesses local guardrail decisions by classifying proposed actions as allowed, unrelated, or unsafe, relative to the original instruction and current interface state. Second, a risk-augmented execution suite features manually constructed OSWorld-derived task variants that introduce latent hazards, such as destructive overwrites, while maintaining the original task's achievability. This suite uses augmented evaluators to differentiate safe completions from unsafe ones that still meet the nominal task objective. Experimental results on OSGuard indicate that existing multimodal guardrails perform adequately on isolated action judgments but reveal significant gaps in reliable end-to-end safety when deployed in full-task scenarios. This dual-granularity approach facilitates precise diagnosis of model capabilities in recognizing unsafe actions and enhancing overall task safety.

Key takeaway

For AI Engineers developing computer-use agents, relying solely on task completion metrics is insufficient for safety. You must integrate dual-granularity benchmarks like OSGuard. This helps diagnose if your models recognize unsafe actions and maintain full-task safety. Prioritize testing with risk-augmented environments to expose latent hazards. This ensures robust end-to-end agent safety, moving beyond isolated guardrail performance.

Key insights

OSGuard's dual-granularity benchmark reveals current guardrails excel locally but fail to ensure end-to-end safety in computer-use agents.

Principles

Task success alone is insufficient for agent safety evaluation.
Local action safety does not guarantee end-to-end safety.
Dual-granularity testing improves safety diagnosis.

Method

OSGuard employs an action-level benchmark for local guardrail decisions and a risk-augmented execution suite with OSWorld-derived tasks and augmented evaluators to identify unsafe completions.

In practice

Integrate action-level safety checks.
Design tasks with latent hazards.
Augment evaluators for safety invariants.

Topics

Computer-Use Agents
Agent Safety Benchmarking
Multimodal Guardrails
OSGuard Benchmark
End-to-End Safety
Risk-Augmented Execution

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, AI Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.