Do Video Foundation Models Understand Intuitive Physics? A Layerwise Probing Analysis

2026-06-08 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A recent study investigates the encoding of intuitive physics information within frozen representations of pretrained video foundation models, analyzing variations across model families, layers, and probe types. Researchers employed frozen-feature probing on the IntPhys2 and Minimal Video Pairs (MVP) benchmarks, comparing predictive joint-embedding models (V-JEPA), masked reconstruction models (VideoMAE), and the diffusion-based video generator LTX-Video. V-JEPA demonstrated the strongest overall performance, particularly with probes designed for temporal dynamics, while VideoMAE remained competitive. LTX-Video exhibited weaker but still discernible signals. Layerwise analysis revealed that physics-relevant information is least accessible in early layers, becoming most prominent at intermediate-to-late depths. Furthermore, disrupting frame order significantly reduced performance, especially on MVP. These findings indicate that intuitive-physics knowledge reliably emerges in pretrained video representations, with its accessibility strongly influenced by the pretraining paradigm, representational depth, and readout mechanism.

Key takeaway

For Machine Learning Engineers developing video understanding systems, recognize that pretrained video foundation models like V-JEPA and VideoMAE inherently capture intuitive physics. You should prioritize models with strong temporal dynamics understanding and focus on extracting features from intermediate-to-late layers for optimal physics-relevant information. This insight can guide model selection and feature engineering, improving performance in tasks requiring real-world physical reasoning.

Key insights

Pretrained video foundation models reliably encode intuitive physics, with accessibility varying by model, layer depth, and probing method.

Principles

Intuitive physics knowledge emerges in video representations.
Accessibility depends on pretraining paradigm.
Deeper layers hold more physics-relevant information.

Method

Frozen-feature probing was applied to IntPhys2 and Minimal Video Pairs (MVP) benchmarks, comparing V-JEPA, VideoMAE, and LTX-Video models to assess intuitive physics encoding.

In practice

Use V-JEPA for strong intuitive physics understanding.
Focus on intermediate-to-late layers for physics signals.
Consider temporal dynamics in video model evaluation.

Topics

Video Foundation Models
Intuitive Physics
V-JEPA
VideoMAE
Representation Learning
Temporal Dynamics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.