ComAct: Reframing Professional Software Manipulation via COM-as-Action Paradigm

2025-03-25 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Software Development & Engineering · Depth: Expert, extended

Summary

ComAct introduces a novel paradigm for professional software manipulation, reframing interaction as deterministic program synthesis via the Component Object Model (COM). This approach addresses critical limitations of existing GUI-based agents, which suffer from fragile visual grounding and error accumulation, and API-based methods, constrained by heterogeneous protocols and inaccessible commercial interfaces. To validate ComAct, the researchers developed ComCADBench, the first benchmark for agents operating real industrial CAD software, including SolidWorks, Inventor, and AutoCAD, across 1,000 tasks. They also created ComActor, a self-correcting agent trained through a progressive three-stage framework, and ComForge, a scalable platform utilizing Dockerized Windows environments for large-scale training. ComActor achieved superior performance on ComCADBench, demonstrating strong resilience in long-horizon tasks and outperforming frontier proprietary models like GPT-5 and Claude-Sonnet-4.6. It also generalized effectively to external CAD benchmarks such as Text2CAD and CADPrompt.

Key takeaway

For AI Engineers developing agents for complex professional software like CAD, you should prioritize programmatic interfaces over GUI-based approaches. The ComAct paradigm, leveraging COM for deterministic program synthesis, offers superior reliability and universality, especially for long-horizon tasks. Consider adopting a multi-stage training framework, including geometric reward optimization, to ensure your agents achieve both syntactic correctness and task-level fidelity, avoiding the pitfalls of fragile visual grounding.

Key insights

COM-as-Action reframes professional software manipulation as deterministic program synthesis, overcoming GUI fragility and API limitations.

Principles

COM offers a unified, semantic programmatic interface.
Code-driven execution prevents cascading errors in long tasks.
Geometric reward optimization bridges syntax-geometry gap.

Method

ComActor is trained via a three-stage framework: instruction-to-code SFT, agentic refinement with multimodal feedback, and task-level GRPO with continuous geometric reward, all within ComForge's parallelized Windows environments.

In practice

Implement COM for robust industrial software automation.
Employ multi-stage training for self-correcting agents.

Topics

Component Object Model
AI Agents
CAD Automation
Program Synthesis
Reinforcement Learning
ComCADBench Benchmark

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.