Human-Guided Harm Recovery for Computer Use Agents
Summary
Researchers have formalized "harm recovery" for large language model (LLM) agents operating on computer systems, addressing the challenge of steering an agent from a harmful state back to a safe one. This framework focuses on aligning recovery with human preferences, identified through a formative user study that yielded a natural language rubric. A dataset of 1,150 pairwise judgments revealed context-dependent shifts in attribute importance, favoring pragmatic, targeted strategies over comprehensive long-term approaches. These insights were operationalized into a reward model used to re-rank candidate recovery plans. To systematically evaluate recovery, the team introduced BackBench, a benchmark of 50 computer-use tasks. Human evaluations demonstrated that their reward model scaffold produces higher-quality recovery trajectories compared to base agents and rubric-based scaffolds.
Key takeaway
For research scientists developing autonomous agents, understanding human-guided harm recovery is crucial. Your agent designs should incorporate mechanisms for post-execution remediation, leveraging preference-aligned reward models to navigate harmful states effectively. Consider using the BackBench benchmark to rigorously test and validate your agent's recovery capabilities, ensuring robustness beyond mere prevention.
Key insights
Harm recovery formalizes steering LLM agents from harmful states to safe ones, aligned with human preferences.
Principles
- Recovery preferences are context-dependent.
- Pragmatic recovery often outweighs comprehensive.
Method
A reward model, trained on 1,150 human judgments and a natural language rubric, re-ranks agent-generated recovery plans to align with human preferences.
In practice
- Develop preference-aligned recovery strategies.
- Use BackBench for systematic recovery evaluation.
Topics
- Harm Recovery
- LLM Agents
- Agent Safety
- Human Preferences
- Reward Models
Best for: Research Scientist, AI Scientist, AI Security Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.