ComAct: Reframing Professional Software Manipulation via COM-as-Action Paradigm

· Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Software Development & Engineering · Depth: Expert, extended

Summary

ComAct introduces a novel paradigm for professional software manipulation, reframing interaction as deterministic program synthesis via the Component Object Model (COM). This approach addresses critical limitations of existing GUI-based agents, which suffer from fragile visual grounding and error accumulation, and API-based methods, constrained by heterogeneous protocols and inaccessible commercial interfaces. To validate ComAct, the researchers developed ComCADBench, the first benchmark for agents operating real industrial CAD software, including SolidWorks, Inventor, and AutoCAD, across 1,000 tasks. They also created ComActor, a self-correcting agent trained through a progressive three-stage framework, and ComForge, a scalable platform utilizing Dockerized Windows environments for large-scale training. ComActor achieved superior performance on ComCADBench, demonstrating strong resilience in long-horizon tasks and outperforming frontier proprietary models like GPT-5 and Claude-Sonnet-4.6. It also generalized effectively to external CAD benchmarks such as Text2CAD and CADPrompt.

Key takeaway

For AI Engineers developing agents for complex professional software like CAD, you should prioritize programmatic interfaces over GUI-based approaches. The ComAct paradigm, leveraging COM for deterministic program synthesis, offers superior reliability and universality, especially for long-horizon tasks. Consider adopting a multi-stage training framework, including geometric reward optimization, to ensure your agents achieve both syntactic correctness and task-level fidelity, avoiding the pitfalls of fragile visual grounding.

Key insights

COM-as-Action reframes professional software manipulation as deterministic program synthesis, overcoming GUI fragility and API limitations.

Principles

Method

ComActor is trained via a three-stage framework: instruction-to-code SFT, agentic refinement with multimodal feedback, and task-level GRPO with continuous geometric reward, all within ComForge's parallelized Windows environments.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.