Causal Scaffolding for Physical Reasoning: A Benchmark for Causally-Informed Physical World Understanding in VLMs
Summary
CausalPhys is a new benchmark featuring over 3,000 expert-curated video- and image-based questions designed to evaluate vision-language models' (VLMs) causal physical reasoning. Spanning four domains—Perception, Anticipation, Intervention, and Goal Orientation—and 16 subcategories, each question includes an expert-annotated causal graph capturing object–attribute–event dependencies. This enables a novel causal-graph-grounded metric for interpretable, fine-grained evaluation beyond answer-only accuracy, diagnosing VLM failures in causal understanding. Analysis of leading VLMs using CausalPhys reveals systematic gaps in capturing causal dependencies, particularly a persistent disparity between entity recognition and relational reasoning. To address this, the authors propose Causal Rationale-informed Fine-Tuning (CRFT), which explicitly aligns VLM reasoning with causal structures, demonstrating substantial enhancements in both reasoning accuracy and interpretability across multiple model backbones. The benchmark and code are publicly available.
Key takeaway
For AI Scientists and Machine Learning Engineers developing embodied AI, you should integrate causally-informed benchmarks like CausalPhys to diagnose VLM reasoning failures beyond simple accuracy. Your models likely struggle with relational understanding despite strong entity recognition. Consider applying Causal Rationale-informed Fine-Tuning (CRFT) to explicitly align your VLM's latent reasoning with causal structures, which can significantly enhance both accuracy and interpretability in dynamic physical environments. This approach is crucial for building more reliable and trustworthy AI systems.
Key insights
VLMs struggle with causal physical reasoning, necessitating benchmarks with explicit causal graphs and causality-aware training.
Principles
- Causal graphs enable mechanism-level VLM evaluation.
- Scaling parameters alone does not ensure causal generalization.
- Entity recognition does not imply relational understanding.
Method
Causal Rationale-informed Fine-Tuning (CRFT) aligns VLM reasoning with expert-annotated causal graphs using a mixed rationale- and answer-level loss.
In practice
- Use CausalPhys to diagnose VLM causal failures.
- Apply CRFT to improve VLM physical reasoning.
- Evaluate VLMs beyond answer-only accuracy.
Topics
- Vision-Language Models
- Causal Reasoning
- Physical Reasoning
- CausalPhys Benchmark
- Causal Rationale Fine-Tuning
- Embodied AI
Code references
Best for: Research Scientist, Computer Vision Engineer, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.