Learning from Failure: Inference-Time Self-Improvement for Computer-Use Agents

2026-06-30 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

A new failure-driven self-improvement loop enhances computer-use agents by leveraging failed trajectories, a departure from traditional success-only fine-tuning. This data-centric paradigm employs a large language model (LLM) to diagnose failure modes, propose inference-time solutions, and generate code patches, which are then lightly human-verified to upgrade the agent. Applied to the OpenCUA-72B model on the OSWorld benchmark, this method significantly improved the success rate from 42.3% to 48.9%, a gain of 6.6 percentage points. This improvement was achieved without additional training costs and with only modest inference overhead, demonstrating that analyzing failures can efficiently complement existing success-based agent improvement pipelines for multimodal large language models (MLLMs).

Key takeaway

For AI Engineers developing computer-use agents, you should integrate failure-driven self-improvement loops to enhance agent performance efficiently. By employing an LLM to diagnose failed trajectories and generate inference-time code patches, you can achieve significant success rate gains, like the 6.6 percentage point increase seen with OpenCUA-72B, without incurring additional training costs. Consider implementing light human verification for proposed patches to ensure robust agent upgrades.

Key insights

Failed trajectories offer rich, untapped information for improving computer-use agents without retraining.

Principles

Failure diagnosis reveals model weaknesses.
Inference-time patches upgrade agents.
Human verification ensures patch quality.

Method

An LLM diagnoses agent failure modes, proposes inference-time solutions, and generates human-verified code patches to upgrade the agent, turning failures into improvements.

In practice

Apply LLMs for failure diagnosis.
Generate code patches for agent fixes.
Integrate human verification for patches.

Topics

Computer-Use Agents
Multimodal LLMs
Failure Analysis
Self-Improvement Loops
Inference-Time Patching
OSWorld Benchmark

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.