Learning from Failure: Inference-Time Self-Improvement for Computer-Use Agents

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

A new failure-driven self-improvement loop enhances computer-use agents by leveraging failed trajectories, a departure from traditional success-only fine-tuning. This data-centric paradigm employs a large language model (LLM) to diagnose failure modes, propose inference-time solutions, and generate code patches, which are then lightly human-verified to upgrade the agent. Applied to the OpenCUA-72B model on the OSWorld benchmark, this method significantly improved the success rate from 42.3% to 48.9%, a gain of 6.6 percentage points. This improvement was achieved without additional training costs and with only modest inference overhead, demonstrating that analyzing failures can efficiently complement existing success-based agent improvement pipelines for multimodal large language models (MLLMs).

Key takeaway

For AI Engineers developing computer-use agents, you should integrate failure-driven self-improvement loops to enhance agent performance efficiently. By employing an LLM to diagnose failed trajectories and generate inference-time code patches, you can achieve significant success rate gains, like the 6.6 percentage point increase seen with OpenCUA-72B, without incurring additional training costs. Consider implementing light human verification for proposed patches to ensure robust agent upgrades.

Key insights

Failed trajectories offer rich, untapped information for improving computer-use agents without retraining.

Principles

Method

An LLM diagnoses agent failure modes, proposes inference-time solutions, and generates human-verified code patches to upgrade the agent, turning failures into improvements.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.