Physics Question Scene Graph: Fine-grained Evaluation of Physical Plausibility in Text-to-Video Generation
Summary
Physics Question Scene Graph (PQSG) is introduced as a hierarchical, question-based evaluation pipeline designed to assess the physical plausibility of videos generated by text-to-video models. This method addresses the current struggle of these models to adhere to basic physical laws and the lack of granular evaluation techniques. PQSG employs a vision-language model (VLM) to generate a graph-based hierarchy of questions, guided by in-context examples, checking video faithfulness across objects, actions, and physical law adherence. Its graph representation ensures logical dependencies and contextual validity, providing fine-grained assessments of specific physical plausibility violations. The system was validated using FinePhyEval, a new dataset comprising physics-based prompts and generated videos from models like Sora 2, Veo 3, and Wan 2.1, with human annotations. PQSG demonstrated higher correlation with human judgments than previous methods and ranked closed-source models above Wan 2.1 in physical realism.
Key takeaway
For machine learning engineers developing or evaluating text-to-video models, PQSG provides a robust, fine-grained method to assess physical plausibility, moving beyond subjective evaluations. You should integrate question-based, graph-structured evaluation to pinpoint specific physical law violations, guiding targeted model improvements. This approach helps you objectively compare models like Sora 2 against others, ensuring your generated videos adhere to real-world physics more effectively.
Key insights
PQSG offers a fine-grained, graph-based question system to evaluate physical plausibility in text-to-video generation.
Principles
- Graph-based questions ensure contextual validity.
- Hierarchical evaluation localizes physical violations.
- VLM-guided questions improve assessment granularity.
Method
PQSG uses a VLM to generate a hierarchical, graph-based question pipeline, guided by in-context examples, to evaluate video adherence to physical laws, objects, and actions.
In practice
- Use PQSG for fine-grained T2V model evaluation.
- Benchmark VLMs on question answering tasks.
- Compare T2V models on physical realism.
Topics
- Text-to-Video Generation
- Physical Plausibility
- Video Evaluation
- Vision-Language Models
- Scene Graph
- FinePhyEval Dataset
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.