MUSE: Benchmarking Manufacturable, Functional, and Assemblable Text-to-CAD Generation

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, long

Summary

MUSE is a new Text-to-CAD benchmark addressing the gap in evaluating complex, editable boundary representation (B-Rep) assemblies for industrial product design. Developed by The Hong Kong Polytechnic University and Curvature Flow Co., Limited, MUSE pairs 106 practical design instances with structured Design Specifications. It introduces a three-stage evaluation protocol: code check, geometric check (watertightness, self-intersection, non-manifold, overlap-free), and design-intent alignment, which assesses functionality, manufacturability, and assemblability using design-specific rubrics. A rubric-based visual language model (VLM) judge, validated by human annotation, enables scalable evaluation. Experiments with closed-source and open-source LLMs reveal a clear failure cascade, showing even strong models achieve limited success on fine-grained engineering criteria, highlighting the need to advance Text-to-CAD beyond geometric generation.

Key takeaway

For AI engineers developing Text-to-CAD models, you must prioritize engineering-ready design over mere geometric similarity. Your models should aim to satisfy functionality, manufacturability, and assemblability criteria, not just visual resemblance. Utilize structured Design Specifications and multi-stage evaluation protocols like MUSE's to identify and address the failure cascade from code to valid geometry and, ultimately, to practical engineering designs. This approach will guide your development towards truly usable industrial CAD generation.

Key insights

Text-to-CAD requires evaluation beyond geometric similarity, focusing on functionality, manufacturability, and assemblability for industrial design.

Principles

Method

MUSE evaluates Text-to-CAD through a three-stage protocol: code execution, geometric validity checks (watertightness, manifold, self-intersection, overlap-free), and design-intent alignment using a rubric-based VLM judge.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.