Learning from Failure: Inference-Time Self-Improvement for Computer-Use Agents
Summary
A new failure-driven self-improvement loop enhances computer-use agents by leveraging failed trajectories, a departure from traditional success-only fine-tuning. This data-centric paradigm employs a large language model (LLM) to diagnose failure modes, propose inference-time solutions, and generate code patches, which are then lightly human-verified to upgrade the agent. Applied to the OpenCUA-72B model on the OSWorld benchmark, this method significantly improved the success rate from 42.3% to 48.9%, a gain of 6.6 percentage points. This improvement was achieved without additional training costs and with only modest inference overhead, demonstrating that analyzing failures can efficiently complement existing success-based agent improvement pipelines for multimodal large language models (MLLMs).
Key takeaway
For AI Engineers developing computer-use agents, you should integrate failure-driven self-improvement loops to enhance agent performance efficiently. By employing an LLM to diagnose failed trajectories and generate inference-time code patches, you can achieve significant success rate gains, like the 6.6 percentage point increase seen with OpenCUA-72B, without incurring additional training costs. Consider implementing light human verification for proposed patches to ensure robust agent upgrades.
Key insights
Failed trajectories offer rich, untapped information for improving computer-use agents without retraining.
Principles
- Failure diagnosis reveals model weaknesses.
- Inference-time patches upgrade agents.
- Human verification ensures patch quality.
Method
An LLM diagnoses agent failure modes, proposes inference-time solutions, and generates human-verified code patches to upgrade the agent, turning failures into improvements.
In practice
- Apply LLMs for failure diagnosis.
- Generate code patches for agent fixes.
- Integrate human verification for patches.
Topics
- Computer-Use Agents
- Multimodal LLMs
- Failure Analysis
- Self-Improvement Loops
- Inference-Time Patching
- OSWorld Benchmark
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.