Pixel Cube: Diffusion-based Portrait Video Relighting Through Realistic Lighting Reproduction
Summary
The "Pixel Cube" project introduces a diffusion-based method for photorealistic and temporally consistent relighting of dynamic portrait videos. This approach leverages a hybrid training dataset, combining real-captured and rendered dynamic portrait videos with diverse subjects, motions, and known lighting conditions. A custom 2m x 2m x 2m LED-based "Pixel Cube" lighting system, featuring 90 high-resolution LED panels with 1,900 nits peak luminance, was constructed for realistic lighting emulation and high-speed 60 FPS data acquisition. By fine-tuning a pre-trained video diffusion model (Stable Video Diffusion) with per-frame HDR environment maps and synthesized background images for lighting control, the model achieves state-of-the-art performance in preserving identity, lighting harmony, and temporal consistency, generalizing well to unseen data.
Key takeaway
For computer vision engineers developing video processing tools, this research offers a robust framework for high-fidelity portrait video relighting. You should consider adopting a hybrid data acquisition strategy, potentially utilizing LED stages like the Pixel Cube, to generate diverse, ground-truth-paired datasets. Implementing a delight-then-relight diffusion pipeline with explicit lighting controls can significantly improve photorealism and temporal consistency in your applications, from enhancing consumer photography to unifying cinematic lighting.
Key insights
Diffusion models, trained on hybrid real/synthetic data from a specialized LED stage, enable photorealistic, temporally consistent portrait video relighting.
Principles
- High-fidelity hybrid datasets are crucial for diffusion model fine-tuning.
- Decomposing relighting into delight and relight steps simplifies the problem.
- Explicit environment map and background control enhance lighting consistency.
Method
The method involves a delight model to estimate flat-lit albedo from an input video, followed by a relight model that uses the albedo, an encoded HDR environment map via cross-attention, and a background image to generate the relit video. Long videos are handled with an overlapping inference scheme.
In practice
- Enhance under-exposed portraits by adding virtual fill lighting.
- Apply professional lighting setups (e.g., Rembrandt) post-capture.
- Unify inconsistent lighting across multiple actors in video scenes.
Topics
- Video Relighting
- Diffusion Models
- Portrait Photography
- LED Lighting Systems
- Hybrid Datasets
- Temporal Consistency
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.