ProCUA-SFT Technical Report

2026-06-15 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Computer Vision & Pattern Recognition · Depth: Expert, quick

Summary

ProCUA-SFT is a new dataset comprising 3.1 million step-level supervised fine-tuning (SFT) samples, distilled from 93,000 synthetic trajectories across 2,484 application combinations. This dataset addresses the negative transfer issue observed with AgentNet, where fine-tuning UI-TARS 7B on AgentNet caused OSWorld success rates to drop from 26.3% to 8-10%. ProCUA-SFT is generated by a fully automated pipeline that synthesizes grounded tasks on live desktops using real-world content, including 912 spreadsheets and 10,000 permissively-licensed presentations. The pipeline verifies task feasibility through binary precondition checking before rollout. A single VLM, Kimi-K2.5, acts as the goal generator, precondition judge, and trajectory executor, eliminating planner-actor capability gaps. Fine-tuning UI-TARS 7B on ProCUA-SFT for one epoch achieves 45.0% on OSWorld, an 18.7 percentage-point improvement over the base model and over 35% above AgentNet-trained models. A subset of ProCUA contributed to the Nemotron 3 Nano Omni model's computer-use capabilities.

Key takeaway

For Machine Learning Engineers developing computer-use agents, you should prioritize high-quality, synthetically generated SFT data over large, potentially noisy human-collected datasets. If you are experiencing negative transfer with existing training data, consider implementing an automated pipeline with precondition checks and a unified VLM for data synthesis. This approach can yield substantial performance gains, as demonstrated by the 18.7 percentage-point improvement on OSWorld.

Key insights

Synthetically generated, verified, and context-rich SFT data significantly improves computer-use agent performance.

Principles

Automated data synthesis can overcome negative transfer.
Precondition checking ensures task feasibility.
Unified VLM roles eliminate planner-actor gaps.

Method

A fully automated pipeline synthesizes grounded tasks on live desktops, seeds them with real-world content, and verifies feasibility via binary precondition checks before trajectory rollout.

In practice

Generate SFT data using a single VLM for multiple roles.
Incorporate precondition checks in synthetic data generation.
Expand trajectories into step-prefix samples for inference context.

Topics

Computer-use Agents
Supervised Fine-tuning
Data Synthesis
Vision-Language Models
Desktop Automation
UI-TARS

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.