DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration

2026-06-02 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

DeskCraft is a new desktop GUI benchmark designed to evaluate AI agents on complex, long-horizon professional workflows requiring human-in-the-loop collaboration. Unlike existing benchmarks that simplify tasks and provide all instructions upfront, DeskCraft features a multilevel difficulty taxonomy, including tasks with over 50 execution steps across professional creative software like design, video, audio, and 3D creation. It formalizes human-agent interaction through a protocol covering mid-turn exchanges for agent-initiated clarification or user interruption, and post-turn feedback after task completion. An evaluation of 18 agents on 538 tasks revealed GPT-5.4 achieved 31.6% on standard tasks and 27.6% on interactive tasks. Analyses highlighted persistent failures in long-horizon workflow delivery and proactive clarification, indicating significant areas for agent improvement. The evaluation codes, tasks, and data will be open-sourced.

Key takeaway

For AI Engineers developing desktop agents for professional creative or engineering software, you should recognize that current models, including GPT-5.4, significantly underperform on long-horizon tasks and human-in-the-loop collaboration. Your development efforts must prioritize robust proactive clarification mechanisms and multi-step workflow delivery to meet real-world demands. Utilize benchmarks like DeskCraft to rigorously test and validate improvements in these critical areas.

Key insights

DeskCraft benchmarks desktop agents on complex, human-collaborative professional workflows, revealing current AI limitations in long-horizon task execution.

Principles

Real-world workflows demand human-agent collaboration.
Long-horizon tasks reveal agent limitations.
Benchmarks must reflect professional software complexity.

Method

DeskCraft formalizes human-agent collaboration via mid-turn (clarification, interruption) and post-turn (feedback) interaction protocols. It uses a multilevel difficulty taxonomy for long-horizon creative and engineering tasks.

In practice

Use DeskCraft to evaluate desktop agent performance.
Focus agent development on proactive clarification.
Improve agents for multi-step creative software tasks.

Topics

Desktop Agents
GUI Benchmarking
Human-in-the-Loop
Professional Workflows
Creative Software
GPT-5.4

Code references

mrwwk/DeskCraft

Best for: Research Scientist, AI Product Manager, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.