Video Reconstruction using Diffusion-based Image-to-Video Generation with Trajectory Guidance

2026-04-01 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Computer Vision · Depth: Expert, extended

Summary

A new pipeline reconstructs missing frames in top-down drone video of autonomous surface vehicles (ASVs) performing maritime maneuvers by leveraging GPS telemetry. The method converts raw GPS coordinates and a single reference frame into a trajectory-guided video sequence using SG-I2V, a pre-trained image-to-video diffusion model, without requiring domain-specific fine-tuning. GPS coordinates are projected into image space via an equirectangular mapping, generating per-vessel motion cues that condition the diffusion model. Evaluated against ground-truth video, the SG-I2V pipeline produced the most naturally appearing frames (BRISQUE 25.52 vs. ground-truth 23.64), the most realistic motion magnitude (temporal smoothness 1.14 vs. ground-truth 1.42), and the strongest GPS trajectory adherence (9.31px vs. 28.70px for ground-truth), outperforming optical flow extrapolation and RIFE interpolation baselines in challenging low-texture, small-object conditions.

Key takeaway

For research scientists working on video reconstruction in challenging environments like maritime surveillance, this work demonstrates that integrating GPS telemetry with image-to-video diffusion models offers a robust solution for synthesizing missing frames. You should consider incorporating auxiliary sensor data to provide explicit motion cues, especially when visual signals alone are insufficient, as this significantly improves frame naturalness, motion realism, and trajectory adherence compared to traditional interpolation methods.

Key insights

Trajectory-guided diffusion models can reconstruct missing video frames by integrating external sensor data.

Principles

Auxiliary sensor data enhances visual synthesis.
Diffusion models generalize without fine-tuning.
Spatial and temporal coherence are critical for video.

Method

The pipeline involves GPS-to-pixel mapping, bounding-box initialization, and trajectory-conditioned video generation using SG-I2V, followed by quantitative evaluation against ground truth.

In practice

Use GPS telemetry to guide video reconstruction.
Project real-world coordinates into image space.
Employ pre-trained diffusion models for synthesis.

Topics

Video Reconstruction
Diffusion Models
Image-to-Video Generation
Trajectory Guidance
Autonomous Surface Vehicles

Code references

hzwer/Practical-RIFEAccessed

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.