Do Video Foundation Models Understand Intuitive Physics? A Layerwise Probing Analysis
Summary
A recent study investigates the encoding of intuitive physics information within frozen representations of pretrained video foundation models, analyzing variations across model families, layers, and probe types. Researchers employed frozen-feature probing on the IntPhys2 and Minimal Video Pairs (MVP) benchmarks, comparing predictive joint-embedding models (V-JEPA), masked reconstruction models (VideoMAE), and the diffusion-based video generator LTX-Video. V-JEPA demonstrated the strongest overall performance, particularly with probes designed for temporal dynamics, while VideoMAE remained competitive. LTX-Video exhibited weaker but still discernible signals. Layerwise analysis revealed that physics-relevant information is least accessible in early layers, becoming most prominent at intermediate-to-late depths. Furthermore, disrupting frame order significantly reduced performance, especially on MVP. These findings indicate that intuitive-physics knowledge reliably emerges in pretrained video representations, with its accessibility strongly influenced by the pretraining paradigm, representational depth, and readout mechanism.
Key takeaway
For Machine Learning Engineers developing video understanding systems, recognize that pretrained video foundation models like V-JEPA and VideoMAE inherently capture intuitive physics. You should prioritize models with strong temporal dynamics understanding and focus on extracting features from intermediate-to-late layers for optimal physics-relevant information. This insight can guide model selection and feature engineering, improving performance in tasks requiring real-world physical reasoning.
Key insights
Pretrained video foundation models reliably encode intuitive physics, with accessibility varying by model, layer depth, and probing method.
Principles
- Intuitive physics knowledge emerges in video representations.
- Accessibility depends on pretraining paradigm.
- Deeper layers hold more physics-relevant information.
Method
Frozen-feature probing was applied to IntPhys2 and Minimal Video Pairs (MVP) benchmarks, comparing V-JEPA, VideoMAE, and LTX-Video models to assess intuitive physics encoding.
In practice
- Use V-JEPA for strong intuitive physics understanding.
- Focus on intermediate-to-late layers for physics signals.
- Consider temporal dynamics in video model evaluation.
Topics
- Video Foundation Models
- Intuitive Physics
- V-JEPA
- VideoMAE
- Representation Learning
- Temporal Dynamics
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.