ComAct: Reframing Professional Software Manipulation via COM-as-Action Paradigm
Summary
ComAct introduces a novel paradigm for professional software manipulation, reframing interaction as deterministic program synthesis via the Component Object Model (COM). This approach addresses critical limitations of existing GUI-based agents, which suffer from fragile visual grounding and error accumulation, and API-based methods, constrained by heterogeneous protocols and inaccessible commercial interfaces. To validate ComAct, the researchers developed ComCADBench, the first benchmark for agents operating real industrial CAD software, including SolidWorks, Inventor, and AutoCAD, across 1,000 tasks. They also created ComActor, a self-correcting agent trained through a progressive three-stage framework, and ComForge, a scalable platform utilizing Dockerized Windows environments for large-scale training. ComActor achieved superior performance on ComCADBench, demonstrating strong resilience in long-horizon tasks and outperforming frontier proprietary models like GPT-5 and Claude-Sonnet-4.6. It also generalized effectively to external CAD benchmarks such as Text2CAD and CADPrompt.
Key takeaway
For AI Engineers developing agents for complex professional software like CAD, you should prioritize programmatic interfaces over GUI-based approaches. The ComAct paradigm, leveraging COM for deterministic program synthesis, offers superior reliability and universality, especially for long-horizon tasks. Consider adopting a multi-stage training framework, including geometric reward optimization, to ensure your agents achieve both syntactic correctness and task-level fidelity, avoiding the pitfalls of fragile visual grounding.
Key insights
COM-as-Action reframes professional software manipulation as deterministic program synthesis, overcoming GUI fragility and API limitations.
Principles
- COM offers a unified, semantic programmatic interface.
- Code-driven execution prevents cascading errors in long tasks.
- Geometric reward optimization bridges syntax-geometry gap.
Method
ComActor is trained via a three-stage framework: instruction-to-code SFT, agentic refinement with multimodal feedback, and task-level GRPO with continuous geometric reward, all within ComForge's parallelized Windows environments.
In practice
- Implement COM for robust industrial software automation.
- Employ multi-stage training for self-correcting agents.
Topics
- Component Object Model
- AI Agents
- CAD Automation
- Program Synthesis
- Reinforcement Learning
- ComCADBench Benchmark
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.