SuperVoxelGPT: Adaptive and Ordered 3D Tokenization for Autoregressive Shape Generation

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

SuperVoxelGPT is a novel framework designed to overcome the limitations of 3D generation in autoregressive multimodal large language models (MLLMs) when dealing with high-resolution shapes. Existing 3D tokenization methods either lose spatial ordering with set-based representations or create excessively long, redundant sequences with uniform or octree-based voxel grids. SuperVoxelGPT addresses this by introducing adaptive and deterministically ordered supervoxel tokenization. It first predicts a coarse geometric saliency distribution, then constructs a shape-adaptive supervoxel partition using saliency-guided centroidal Voronoi tessellation, allocating fine-grained cells to complex regions and larger cells to smooth areas. A SuperVoxelVAE is then used to generate supervoxel tokens, fine-tuning a pretrained MLLM. Experiments on Trellis-500K demonstrate that SuperVoxelGPT reduces token sequence length to 12.8% of uniform voxel tokenization, achieving state-of-the-art generation quality and an average 10x speedup over prior methods.

Key takeaway

For AI Scientists and Machine Learning Engineers developing 3D generation models, SuperVoxelGPT offers a significant advancement. If you struggle with long token sequences or ambiguous spatial ordering in MLLM-based 3D generation, investigate adaptive supervoxel tokenization. This method can reduce token length by 87.2% and accelerate generation by 10x. It enables more efficient, higher-quality outputs for complex 3D assets, improving your model's performance and scalability.

Key insights

SuperVoxelGPT enables efficient, high-resolution 3D shape generation by adaptively tokenizing shapes into ordered supervoxels, resolving prior scaling issues.

Principles

Method

Predict geometric saliency, construct shape-adaptive supervoxel partitions via saliency-guided centroidal Voronoi tessellation, then use SuperVoxelVAE to autoregressively generate tokens with a fine-tuned MLLM.

In practice

Topics

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.