Human-Guided Harm Recovery for Computer Use Agents
Summary
Researchers from MIT CSAIL, Abridge, and humans& have introduced a new paradigm for agent safety called "harm recovery," focusing on remediating harmful actions by computer use agents after they occur, rather than solely preventing them. They formalize harm recovery as an optimization problem aligned with human preferences, conducting a formative user study to identify key recovery dimensions and create a natural language rubric. A dataset of 1,150 pairwise judgments revealed context-dependent shifts in attribute importance, with users often prioritizing pragmatic, targeted strategies over comprehensive long-term approaches. These insights were operationalized into a reward model that re-ranks candidate recovery plans generated by an agent scaffold. To systematically evaluate recovery capabilities, the team developed BackBench, a benchmark of 50 computer-use tasks in Ubuntu environments. Human evaluations demonstrated that their reward model scaffold significantly outperforms base agents and rubric-based scaffolds, achieving a 120-point Elo score increase over base models and 45 points over rubric-based methods, particularly effective in resource-constrained scenarios.
Key takeaway
For research scientists developing autonomous agents, integrating post-execution harm recovery mechanisms is crucial. You should consider moving beyond pre-execution safeguards by implementing preference-aligned recovery systems. Training reward models on human judgments, as demonstrated with a 120-point Elo increase, significantly improves an agent's ability to mitigate harm effectively and in alignment with user expectations, especially in resource-constrained scenarios. This approach ensures agents can navigate the aftermath of failures with greater trustworthiness and resilience.
Key insights
Post-execution harm recovery for AI agents requires human-aligned optimization, prioritizing pragmatic, context-dependent strategies.
Principles
- Prevention alone is insufficient for agent safety.
- Optimal recovery balances practical constraints with human values.
- Attribute importance in recovery varies by context.
Method
A generate-and-verify scaffold uses an LM to propose recovery plans, then a reward model, fine-tuned on 1,150 human preference judgments, re-ranks and selects the most aligned plan for execution.
In practice
- Use BackBench to evaluate agent recovery capabilities.
- Train reward models on human preferences for context-aware recovery.
- Prioritize speed and focus in time-critical harm remediation.
Topics
- Harm Recovery
- Computer Use Agents
- Human Preference Alignment
- Reward Models
- BackBench Benchmark
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Security Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.