Native3D: End-to-End 3D Scene Generation via Unified Mesh-Texture Modeling and Semantic Alignment

· Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Gaming & Interactive Media · Depth: Expert, extended

Summary

Native3D is an end-to-end 3D scene generation framework that eliminates the domain adaptation issues of traditional 2D intermediate representations, such as geometric distortion and texture degradation. Developed by Kuaishou GameMind Lab, this system employs a unified mesh-texture joint representation and a Transformer-based scene encoder to model both geometric structures and texture features, maintaining spatial relationships and visual consistency. It introduces the 3D Representation Alignment Loss (3D REPA Loss), an improved contrastive learning mechanism, to align multi-level semantic representations, enhancing fidelity. Native3D supports various scene editing tasks, including object addition, spatial rearrangement, removal, and appearance style transfer. Trained on 9,001 room pairs from the 3D-FRONT dataset, it achieves superior performance, leading in CLIP Score (CS) and structural editing tasks (Add, Remove, Move) compared to existing methods, with a total computational cost of approximately 3100 GPU hours.

Key takeaway

For AI Engineers developing 3D scene generation or editing systems, Native3D offers a critical architectural shift. You should prioritize native 3D modeling over 2D-to-3D conversion to eliminate geometric distortion and texture degradation. This approach, leveraging unified mesh-texture representation and 3D REPA Loss, ensures superior multi-view consistency and precise object manipulation. Integrating such direct 3D frameworks can significantly enhance the fidelity and flexibility of your virtual content creation workflows.

Key insights

Direct 3D scene generation via unified mesh-texture modeling and semantic alignment overcomes 2D representation limitations.

Principles

Method

Native3D employs a Transformer-based scene encoder for joint mesh-texture modeling, a DiT for generation, and 3D REPA Loss for multi-level semantic feature alignment.

In practice

Topics

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.