Native3D: End-to-End 3D Scene Generation via Unified Mesh-Texture Modeling and Semantic Alignment

2026-06-08 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Gaming & Interactive Media · Depth: Expert, extended

Summary

Native3D is an end-to-end 3D scene generation framework that eliminates the domain adaptation issues of traditional 2D intermediate representations, such as geometric distortion and texture degradation. Developed by Kuaishou GameMind Lab, this system employs a unified mesh-texture joint representation and a Transformer-based scene encoder to model both geometric structures and texture features, maintaining spatial relationships and visual consistency. It introduces the 3D Representation Alignment Loss (3D REPA Loss), an improved contrastive learning mechanism, to align multi-level semantic representations, enhancing fidelity. Native3D supports various scene editing tasks, including object addition, spatial rearrangement, removal, and appearance style transfer. Trained on 9,001 room pairs from the 3D-FRONT dataset, it achieves superior performance, leading in CLIP Score (CS) and structural editing tasks (Add, Remove, Move) compared to existing methods, with a total computational cost of approximately 3100 GPU hours.

Key takeaway

For AI Engineers developing 3D scene generation or editing systems, Native3D offers a critical architectural shift. You should prioritize native 3D modeling over 2D-to-3D conversion to eliminate geometric distortion and texture degradation. This approach, leveraging unified mesh-texture representation and 3D REPA Loss, ensures superior multi-view consistency and precise object manipulation. Integrating such direct 3D frameworks can significantly enhance the fidelity and flexibility of your virtual content creation workflows.

Key insights

Direct 3D scene generation via unified mesh-texture modeling and semantic alignment overcomes 2D representation limitations.

Principles

Unified mesh-texture modeling preserves spatial consistency.
Multi-level 3D semantic alignment improves fidelity.
Bypassing 2D intermediates prevents geometric distortion.

Method

Native3D employs a Transformer-based scene encoder for joint mesh-texture modeling, a DiT for generation, and 3D REPA Loss for multi-level semantic feature alignment.

In practice

Generate photorealistic 3D indoor scenes from text.
Edit 3D scenes: add, remove, move objects.
Apply appearance style transfer to 3D environments.

Topics

3D Scene Generation
Mesh-Texture Modeling
Diffusion Transformers
3D REPA Loss
Semantic Alignment
Indoor Scene Editing

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.