Flex4DHuman: Flexible Multi-view Video Diffusion for 4D Human Reconstruction

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Emerging Technologies & Innovation · Depth: Expert, quick

Summary

Flex4DHuman is a multi-view video diffusion model that transforms monocular or sparse multi-view videos of dynamic subjects into synchronized dense multi-view videos using only relative camera-pose conditioning. Unlike prior human-centric methods, it requires no explicit geometry priors, instead conditioning generation through relative camera-pose positional encoding. Built on the Wan 2.1 1.3B text-to-video model, Flex4DHuman encodes camera and view information via a five-axis positional encoding extending spatio-temporal RoPE. A three-stage curriculum trains the model for pose following, flexible reference-to-target view generation, and temporal rollout, supported by clean historical target-view tokens and multi-view captions for text control. Combined with an off-the-shelf 4D Gaussian Splatting stage, it lifts monocular static-camera videos into dynamic 4D Gaussian splats. Experiments on DNA-Rendering and ActorsHQ show Flex4DHuman surpasses prior state-of-the-art, generalizing to animal categories after mixed human-animal training, making it practical for scalable 4D content creation in gaming and AR/VR.

Key takeaway

For Computer Vision Engineers or 3D Content Creators aiming to generate dynamic 4D assets from casual video, Flex4DHuman offers a significant workflow improvement. You can now transform monocular or sparse multi-view footage into high-quality 4D Gaussian splats without needing complex geometry priors. This simplifies content creation for AR/VR, gaming, and simulation, allowing you to rapidly prototype and deploy dynamic human and animal models. Consider integrating this diffusion-based approach to accelerate your 4D asset pipeline.

Key insights

Flex4DHuman uses multi-view video diffusion with relative camera-pose conditioning to reconstruct 4D humans and animals without explicit geometry priors.

Principles

Method

Flex4DHuman extends Wan 2.1 1.3B with a five-axis positional encoding for camera/view data and trains through a three-stage curriculum for pose following, view generation, and temporal rollout, integrating 4D Gaussian Splatting.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.