Human-Guided Harm Recovery for Computer Use Agents

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Cybersecurity & Data Privacy · Depth: Expert, extended

Summary

Researchers from MIT CSAIL, Abridge, and humans& have introduced a new paradigm for agent safety called "harm recovery," focusing on remediating harmful actions by computer use agents after they occur, rather than solely preventing them. They formalize harm recovery as an optimization problem aligned with human preferences, conducting a formative user study to identify key recovery dimensions and create a natural language rubric. A dataset of 1,150 pairwise judgments revealed context-dependent shifts in attribute importance, with users often prioritizing pragmatic, targeted strategies over comprehensive long-term approaches. These insights were operationalized into a reward model that re-ranks candidate recovery plans generated by an agent scaffold. To systematically evaluate recovery capabilities, the team developed BackBench, a benchmark of 50 computer-use tasks in Ubuntu environments. Human evaluations demonstrated that their reward model scaffold significantly outperforms base agents and rubric-based scaffolds, achieving a 120-point Elo score increase over base models and 45 points over rubric-based methods, particularly effective in resource-constrained scenarios.

Key takeaway

For research scientists developing autonomous agents, integrating post-execution harm recovery mechanisms is crucial. You should consider moving beyond pre-execution safeguards by implementing preference-aligned recovery systems. Training reward models on human judgments, as demonstrated with a 120-point Elo increase, significantly improves an agent's ability to mitigate harm effectively and in alignment with user expectations, especially in resource-constrained scenarios. This approach ensures agents can navigate the aftermath of failures with greater trustworthiness and resilience.

Key insights

Post-execution harm recovery for AI agents requires human-aligned optimization, prioritizing pragmatic, context-dependent strategies.

Principles

Method

A generate-and-verify scaffold uses an LM to propose recovery plans, then a reward model, fine-tuned on 1,150 human preference judgments, re-ranks and selects the most aligned plan for execution.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.