Human-Guided Harm Recovery for Computer Use Agents

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Cybersecurity & Data Privacy · Depth: Expert, quick

Summary

Researchers have formalized "harm recovery" for large language model (LLM) agents operating on computer systems, addressing the challenge of steering an agent from a harmful state back to a safe one. This framework focuses on aligning recovery with human preferences, identified through a formative user study that yielded a natural language rubric. A dataset of 1,150 pairwise judgments revealed context-dependent shifts in attribute importance, favoring pragmatic, targeted strategies over comprehensive long-term approaches. These insights were operationalized into a reward model used to re-rank candidate recovery plans. To systematically evaluate recovery, the team introduced BackBench, a benchmark of 50 computer-use tasks. Human evaluations demonstrated that their reward model scaffold produces higher-quality recovery trajectories compared to base agents and rubric-based scaffolds.

Key takeaway

For research scientists developing autonomous agents, understanding human-guided harm recovery is crucial. Your agent designs should incorporate mechanisms for post-execution remediation, leveraging preference-aligned reward models to navigate harmful states effectively. Consider using the BackBench benchmark to rigorously test and validate your agent's recovery capabilities, ensuring robustness beyond mere prevention.

Key insights

Harm recovery formalizes steering LLM agents from harmful states to safe ones, aligned with human preferences.

Principles

Method

A reward model, trained on 1,150 human judgments and a natural language rubric, re-ranks agent-generated recovery plans to align with human preferences.

In practice

Topics

Best for: Research Scientist, AI Scientist, AI Security Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.