GenLCA: 3D Diffusion for Full-Body Avatars from In-the-Wild Videos
Summary
GenLCA is a diffusion-based generative model designed to create and edit photorealistic full-body avatars using text and image inputs. Developed by Timur Bagautdinov, Lei Xiao, Zhaoen Su, Shunsuke Saito, Xiaogang Jin, and four other authors, this model generates avatars that are faithful to the input while supporting high-fidelity facial and full-body animations. A key innovation is its ability to train a full-body 3D diffusion model from millions of partially observable 2D real-world video frames. This is achieved by repurposing a pretrained feed-forward avatar reconstruction model as an animatable 3D tokenizer, which converts unstructured video frames into structured 3D tokens. To mitigate blurring and transparency artifacts from partial observations, GenLCA employs a visibility-aware diffusion training strategy that replaces invalid regions with learnable tokens and computes losses only over valid areas. This method allows for the effective use of large-scale real-world video data for native 3D diffusion model training, leading to superior photorealism and generalizability compared to existing solutions.
Key takeaway
For research scientists developing 3D generative models, GenLCA demonstrates a critical paradigm shift: training 3D diffusion models effectively with large-scale, partially observed 2D video data. You should consider adopting visibility-aware training strategies and repurposing existing 2D models as 3D tokenizers to overcome data scarcity and improve photorealism in your own projects.
Key insights
GenLCA generates photorealistic 3D avatars from partial 2D video data using a novel visibility-aware diffusion training strategy.
Principles
- Scale 3D diffusion training with partial 2D data.
- Repurpose pretrained models as 3D tokenizers.
Method
GenLCA uses a pretrained avatar reconstruction model as a 3D tokenizer to encode video frames into structured 3D tokens. A visibility-aware diffusion strategy then trains a flow-based diffusion model, processing only valid regions to prevent artifacts.
In practice
- Generate full-body avatars from text/images.
- Edit existing photorealistic avatars.
- Animate avatars with high fidelity.
Topics
- GenLCA
- 3D Diffusion Models
- Full-Body Avatars
- In-the-Wild Videos
- Animatable 3D Tokenizer
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.