GenLCA: 3D Diffusion for Full-Body Avatars from In-the-Wild Videos

2026-04-08 · Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Gaming & Interactive Media · Depth: Expert, quick

Summary

GenLCA is a diffusion-based generative model designed to create and edit photorealistic full-body avatars using text and image inputs. Developed by Timur Bagautdinov, Lei Xiao, Zhaoen Su, Shunsuke Saito, Xiaogang Jin, and four other authors, this model generates avatars that are faithful to the input while supporting high-fidelity facial and full-body animations. A key innovation is its ability to train a full-body 3D diffusion model from millions of partially observable 2D real-world video frames. This is achieved by repurposing a pretrained feed-forward avatar reconstruction model as an animatable 3D tokenizer, which converts unstructured video frames into structured 3D tokens. To mitigate blurring and transparency artifacts from partial observations, GenLCA employs a visibility-aware diffusion training strategy that replaces invalid regions with learnable tokens and computes losses only over valid areas. This method allows for the effective use of large-scale real-world video data for native 3D diffusion model training, leading to superior photorealism and generalizability compared to existing solutions.

Key takeaway

For research scientists developing 3D generative models, GenLCA demonstrates a critical paradigm shift: training 3D diffusion models effectively with large-scale, partially observed 2D video data. You should consider adopting visibility-aware training strategies and repurposing existing 2D models as 3D tokenizers to overcome data scarcity and improve photorealism in your own projects.

Key insights

GenLCA generates photorealistic 3D avatars from partial 2D video data using a novel visibility-aware diffusion training strategy.

Principles

Scale 3D diffusion training with partial 2D data.
Repurpose pretrained models as 3D tokenizers.

Method

GenLCA uses a pretrained avatar reconstruction model as a 3D tokenizer to encode video frames into structured 3D tokens. A visibility-aware diffusion strategy then trains a flow-based diffusion model, processing only valid regions to prevent artifacts.

In practice

Generate full-body avatars from text/images.
Edit existing photorealistic avatars.
Animate avatars with high fidelity.

Topics

GenLCA
3D Diffusion Models
Full-Body Avatars
In-the-Wild Videos
Animatable 3D Tokenizer

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.