Pixel Cube: Diffusion-based Portrait Video Relighting Through Realistic Lighting Reproduction

· Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision & Graphics, Emerging Technologies & Innovation · Depth: Expert, extended

Summary

The "Pixel Cube" project introduces a diffusion-based method for photorealistic and temporally consistent relighting of dynamic portrait videos. This approach leverages a hybrid training dataset, combining real-captured and rendered dynamic portrait videos with diverse subjects, motions, and known lighting conditions. A custom 2m x 2m x 2m LED-based "Pixel Cube" lighting system, featuring 90 high-resolution LED panels with 1,900 nits peak luminance, was constructed for realistic lighting emulation and high-speed 60 FPS data acquisition. By fine-tuning a pre-trained video diffusion model (Stable Video Diffusion) with per-frame HDR environment maps and synthesized background images for lighting control, the model achieves state-of-the-art performance in preserving identity, lighting harmony, and temporal consistency, generalizing well to unseen data.

Key takeaway

For computer vision engineers developing video processing tools, this research offers a robust framework for high-fidelity portrait video relighting. You should consider adopting a hybrid data acquisition strategy, potentially utilizing LED stages like the Pixel Cube, to generate diverse, ground-truth-paired datasets. Implementing a delight-then-relight diffusion pipeline with explicit lighting controls can significantly improve photorealism and temporal consistency in your applications, from enhancing consumer photography to unifying cinematic lighting.

Key insights

Diffusion models, trained on hybrid real/synthetic data from a specialized LED stage, enable photorealistic, temporally consistent portrait video relighting.

Principles

Method

The method involves a delight model to estimate flat-lit albedo from an input video, followed by a relight model that uses the albedo, an encoded HDR environment map via cross-attention, and a background image to generate the relit video. Long videos are handled with an overlapping inference scheme.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.