Beyond Voxel 3D Editing: Learning from 3D Masks and Self-Constructed Data

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, 3D Content Generation & Editing · Depth: Expert, extended

Summary

The Beyond Voxel 3D Editing (BVE) framework, developed by researchers from Peking University, HKUST(GZ), Microsoft AI, and Microsoft Research, addresses critical limitations in 3D asset modification, including data scarcity and challenges in maintaining semantic and local consistency. BVE introduces a self-constructed, large-scale dataset called Edit-3DVerse, comprising over 100,000 high-quality samples, to train and evaluate 3D editing models. The framework enhances a foundational image-to-3D generative architecture with lightweight, trainable modules, such as KVComposer and Tri-Attention Block, enabling efficient text-driven semantic injection without extensive retraining. Furthermore, BVE incorporates an annotation-free 3D masking strategy to preserve local invariance in unedited regions. Extensive experiments demonstrate BVE's superior performance in generating high-quality, text-aligned 3D assets while faithfully retaining original visual characteristics, outperforming state-of-the-art baselines like Vox-E, Tailor3D, and TRELLIS across various quantitative and qualitative metrics.

Key takeaway

For research scientists developing 3D generative models, BVE offers a robust approach to overcome current editing limitations. You should consider adopting its lightweight module integration and annotation-free 3D masking strategy to achieve superior semantic consistency and identity preservation in your text-guided 3D editing frameworks. This method significantly improves both local and global editing capabilities, reducing computational costs associated with full model retraining.

Key insights

BVE enables high-fidelity, text-guided 3D editing by integrating lightweight modules and a 3D masking strategy into a generative architecture.

Principles

Method

BVE integrates KVComposer and Tri-Attention Blocks into a pre-trained TRELLIS framework, using a two-stage rectified flow model for structure and latent editing, and an automatic 3D mask loss for spatial consistency.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.