Physics Question Scene Graph: Fine-grained Evaluation of Physical Plausibility in Text-to-Video Generation

2026-06-24 · Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Computer Vision & Image Processing · Depth: Expert, medium

Summary

Physics Question Scene Graph (PQSG) is introduced as a hierarchical, question-based evaluation pipeline designed to assess the physical plausibility of videos generated by text-to-video models. This method addresses the current struggle of these models to adhere to basic physical laws and the lack of granular evaluation techniques. PQSG employs a vision-language model (VLM) to generate a graph-based hierarchy of questions, guided by in-context examples, checking video faithfulness across objects, actions, and physical law adherence. Its graph representation ensures logical dependencies and contextual validity, providing fine-grained assessments of specific physical plausibility violations. The system was validated using FinePhyEval, a new dataset comprising physics-based prompts and generated videos from models like Sora 2, Veo 3, and Wan 2.1, with human annotations. PQSG demonstrated higher correlation with human judgments than previous methods and ranked closed-source models above Wan 2.1 in physical realism.

Key takeaway

For machine learning engineers developing or evaluating text-to-video models, PQSG provides a robust, fine-grained method to assess physical plausibility, moving beyond subjective evaluations. You should integrate question-based, graph-structured evaluation to pinpoint specific physical law violations, guiding targeted model improvements. This approach helps you objectively compare models like Sora 2 against others, ensuring your generated videos adhere to real-world physics more effectively.

Key insights

PQSG offers a fine-grained, graph-based question system to evaluate physical plausibility in text-to-video generation.

Principles

Graph-based questions ensure contextual validity.
Hierarchical evaluation localizes physical violations.
VLM-guided questions improve assessment granularity.

Method

PQSG uses a VLM to generate a hierarchical, graph-based question pipeline, guided by in-context examples, to evaluate video adherence to physical laws, objects, and actions.

In practice

Use PQSG for fine-grained T2V model evaluation.
Benchmark VLMs on question answering tasks.
Compare T2V models on physical realism.

Topics

Text-to-Video Generation
Physical Plausibility
Video Evaluation
Vision-Language Models
Scene Graph
FinePhyEval Dataset

Code references

Zeqing-Wang/PhyDetEx

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.