DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

DeskCraft is a new desktop GUI benchmark designed to evaluate AI agents on complex, long-horizon professional workflows requiring human-in-the-loop collaboration. Unlike existing benchmarks that simplify tasks and provide all instructions upfront, DeskCraft features a multilevel difficulty taxonomy, including tasks with over 50 execution steps across professional creative software like design, video, audio, and 3D creation. It formalizes human-agent interaction through a protocol covering mid-turn exchanges for agent-initiated clarification or user interruption, and post-turn feedback after task completion. An evaluation of 18 agents on 538 tasks revealed GPT-5.4 achieved 31.6% on standard tasks and 27.6% on interactive tasks. Analyses highlighted persistent failures in long-horizon workflow delivery and proactive clarification, indicating significant areas for agent improvement. The evaluation codes, tasks, and data will be open-sourced.

Key takeaway

For AI Engineers developing desktop agents for professional creative or engineering software, you should recognize that current models, including GPT-5.4, significantly underperform on long-horizon tasks and human-in-the-loop collaboration. Your development efforts must prioritize robust proactive clarification mechanisms and multi-step workflow delivery to meet real-world demands. Utilize benchmarks like DeskCraft to rigorously test and validate improvements in these critical areas.

Key insights

DeskCraft benchmarks desktop agents on complex, human-collaborative professional workflows, revealing current AI limitations in long-horizon task execution.

Principles

Method

DeskCraft formalizes human-agent collaboration via mid-turn (clarification, interruption) and post-turn (feedback) interaction protocols. It uses a multilevel difficulty taxonomy for long-horizon creative and engineering tasks.

In practice

Topics

Code references

Best for: Research Scientist, AI Product Manager, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.