The Invisible Hand of Physics: When Video Diffusion Models Know More Than They Show

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

A study investigates whether modern video diffusion models internally encode physical structure, beyond merely reproducing motion patterns. Researchers probed these models by approximately inverting the deterministic sampling process, integrating the learned velocity field backward from a clean video latent to noise. This method provided access to the model's intermediate states and attention maps along latent trajectories corresponding to real videos with known physical plausibility. The findings reveal that physical plausibility is linearly decodable from diffusion transformer states across IntPhys and InfLevel datasets, achieving approximately 81.27% average accuracy. This performance surpasses dedicated representation-learning baselines like V-JEPA and VideoMAE. Surprisingly, this physical signal is absent from the VAE latent input and emerges within the denoising transformer itself, despite the model not being trained with a self-supervised predictive objective. This suggests that physically meaningful representations can arise as a byproduct of generative denoising.

Key takeaway

For AI Scientists and Research Scientists exploring generative model capabilities, this work indicates video diffusion models implicitly encode physical plausibility. You should consider these models not just as pattern generators but as potential world simulators with emergent physical understanding. This suggests new avenues for designing models that leverage generative denoising to build robust, physically-aware representations, potentially improving performance in tasks requiring complex physical reasoning or predictive modeling.

Key insights

Video diffusion models implicitly learn physical plausibility during generative denoising, outperforming explicit representation learners.

Principles

Method

The method involves approximately inverting the deterministic sampling process by integrating the learned velocity field backward from a clean video latent to noise, accessing intermediate states for probing.

In practice

Topics

Best for: AI Scientist, Research Scientist, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.