Causal Scaffolding for Physical Reasoning: A Benchmark for Causally-Informed Physical World Understanding in VLMs

2025-09-25 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

CausalPhys is a new benchmark featuring over 3,000 expert-curated video- and image-based questions designed to evaluate vision-language models' (VLMs) causal physical reasoning. Spanning four domains—Perception, Anticipation, Intervention, and Goal Orientation—and 16 subcategories, each question includes an expert-annotated causal graph capturing object–attribute–event dependencies. This enables a novel causal-graph-grounded metric for interpretable, fine-grained evaluation beyond answer-only accuracy, diagnosing VLM failures in causal understanding. Analysis of leading VLMs using CausalPhys reveals systematic gaps in capturing causal dependencies, particularly a persistent disparity between entity recognition and relational reasoning. To address this, the authors propose Causal Rationale-informed Fine-Tuning (CRFT), which explicitly aligns VLM reasoning with causal structures, demonstrating substantial enhancements in both reasoning accuracy and interpretability across multiple model backbones. The benchmark and code are publicly available.

Key takeaway

For AI Scientists and Machine Learning Engineers developing embodied AI, you should integrate causally-informed benchmarks like CausalPhys to diagnose VLM reasoning failures beyond simple accuracy. Your models likely struggle with relational understanding despite strong entity recognition. Consider applying Causal Rationale-informed Fine-Tuning (CRFT) to explicitly align your VLM's latent reasoning with causal structures, which can significantly enhance both accuracy and interpretability in dynamic physical environments. This approach is crucial for building more reliable and trustworthy AI systems.

Key insights

VLMs struggle with causal physical reasoning, necessitating benchmarks with explicit causal graphs and causality-aware training.

Principles

Causal graphs enable mechanism-level VLM evaluation.
Scaling parameters alone does not ensure causal generalization.
Entity recognition does not imply relational understanding.

Method

Causal Rationale-informed Fine-Tuning (CRFT) aligns VLM reasoning with expert-annotated causal graphs using a mixed rationale- and answer-level loss.

In practice

Use CausalPhys to diagnose VLM causal failures.
Apply CRFT to improve VLM physical reasoning.
Evaluate VLMs beyond answer-only accuracy.

Topics

Vision-Language Models
Causal Reasoning
Physical Reasoning
CausalPhys Benchmark
Causal Rationale Fine-Tuning
Embodied AI

Code references

haorentang/CausalPhys

Best for: Research Scientist, Computer Vision Engineer, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.