Flex4DHuman: Flexible Multi-view Video Diffusion for 4D Human Reconstruction

2026-06-11 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Emerging Technologies & Innovation · Depth: Expert, quick

Summary

Flex4DHuman is a multi-view video diffusion model that transforms monocular or sparse multi-view videos of dynamic subjects into synchronized dense multi-view videos using only relative camera-pose conditioning. Unlike prior human-centric methods, it requires no explicit geometry priors, instead conditioning generation through relative camera-pose positional encoding. Built on the Wan 2.1 1.3B text-to-video model, Flex4DHuman encodes camera and view information via a five-axis positional encoding extending spatio-temporal RoPE. A three-stage curriculum trains the model for pose following, flexible reference-to-target view generation, and temporal rollout, supported by clean historical target-view tokens and multi-view captions for text control. Combined with an off-the-shelf 4D Gaussian Splatting stage, it lifts monocular static-camera videos into dynamic 4D Gaussian splats. Experiments on DNA-Rendering and ActorsHQ show Flex4DHuman surpasses prior state-of-the-art, generalizing to animal categories after mixed human-animal training, making it practical for scalable 4D content creation in gaming and AR/VR.

Key takeaway

For Computer Vision Engineers or 3D Content Creators aiming to generate dynamic 4D assets from casual video, Flex4DHuman offers a significant workflow improvement. You can now transform monocular or sparse multi-view footage into high-quality 4D Gaussian splats without needing complex geometry priors. This simplifies content creation for AR/VR, gaming, and simulation, allowing you to rapidly prototype and deploy dynamic human and animal models. Consider integrating this diffusion-based approach to accelerate your 4D asset pipeline.

Key insights

Flex4DHuman uses multi-view video diffusion with relative camera-pose conditioning to reconstruct 4D humans and animals without explicit geometry priors.

Principles

Relative camera-pose conditioning enables geometry-free 4D reconstruction.
Spatio-temporal RoPE can be extended with view indices and SE(3) geometry.
Progressive curriculum training improves complex video generation.

Method

Flex4DHuman extends Wan 2.1 1.3B with a five-axis positional encoding for camera/view data and trains through a three-stage curriculum for pose following, view generation, and temporal rollout, integrating 4D Gaussian Splatting.

In practice

Create dynamic 4D Gaussian splats from monocular videos.
Generate 4D content for simulation, gaming, and AR/VR.
Reconstruct both human and animal subjects.

Topics

4D Human Reconstruction
Multi-view Video Diffusion
Gaussian Splatting
Relative Camera Pose
Dynamic Content Creation
AR/VR Applications

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.