Pixel Cube: Diffusion-based Portrait Video Relighting Through Realistic Lighting Reproduction

2026-06-08 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision & Graphics, Emerging Technologies & Innovation · Depth: Expert, extended

Summary

The "Pixel Cube" project introduces a diffusion-based method for photorealistic and temporally consistent relighting of dynamic portrait videos. This approach leverages a hybrid training dataset, combining real-captured and rendered dynamic portrait videos with diverse subjects, motions, and known lighting conditions. A custom 2m x 2m x 2m LED-based "Pixel Cube" lighting system, featuring 90 high-resolution LED panels with 1,900 nits peak luminance, was constructed for realistic lighting emulation and high-speed 60 FPS data acquisition. By fine-tuning a pre-trained video diffusion model (Stable Video Diffusion) with per-frame HDR environment maps and synthesized background images for lighting control, the model achieves state-of-the-art performance in preserving identity, lighting harmony, and temporal consistency, generalizing well to unseen data.

Key takeaway

For computer vision engineers developing video processing tools, this research offers a robust framework for high-fidelity portrait video relighting. You should consider adopting a hybrid data acquisition strategy, potentially utilizing LED stages like the Pixel Cube, to generate diverse, ground-truth-paired datasets. Implementing a delight-then-relight diffusion pipeline with explicit lighting controls can significantly improve photorealism and temporal consistency in your applications, from enhancing consumer photography to unifying cinematic lighting.

Key insights

Diffusion models, trained on hybrid real/synthetic data from a specialized LED stage, enable photorealistic, temporally consistent portrait video relighting.

Principles

High-fidelity hybrid datasets are crucial for diffusion model fine-tuning.
Decomposing relighting into delight and relight steps simplifies the problem.
Explicit environment map and background control enhance lighting consistency.

Method

The method involves a delight model to estimate flat-lit albedo from an input video, followed by a relight model that uses the albedo, an encoded HDR environment map via cross-attention, and a background image to generate the relit video. Long videos are handled with an overlapping inference scheme.

In practice

Enhance under-exposed portraits by adding virtual fill lighting.
Apply professional lighting setups (e.g., Rembrandt) post-capture.
Unify inconsistent lighting across multiple actors in video scenes.

Topics

Video Relighting
Diffusion Models
Portrait Photography
LED Lighting Systems
Hybrid Datasets
Temporal Consistency

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.