Physics Question Scene Graph: Fine-grained Evaluation of Physical Plausibility in Text-to-Video Generation

· Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Computer Vision & Image Processing · Depth: Expert, medium

Summary

Physics Question Scene Graph (PQSG) is introduced as a hierarchical, question-based evaluation pipeline designed to assess the physical plausibility of videos generated by text-to-video models. This method addresses the current struggle of these models to adhere to basic physical laws and the lack of granular evaluation techniques. PQSG employs a vision-language model (VLM) to generate a graph-based hierarchy of questions, guided by in-context examples, checking video faithfulness across objects, actions, and physical law adherence. Its graph representation ensures logical dependencies and contextual validity, providing fine-grained assessments of specific physical plausibility violations. The system was validated using FinePhyEval, a new dataset comprising physics-based prompts and generated videos from models like Sora 2, Veo 3, and Wan 2.1, with human annotations. PQSG demonstrated higher correlation with human judgments than previous methods and ranked closed-source models above Wan 2.1 in physical realism.

Key takeaway

For machine learning engineers developing or evaluating text-to-video models, PQSG provides a robust, fine-grained method to assess physical plausibility, moving beyond subjective evaluations. You should integrate question-based, graph-structured evaluation to pinpoint specific physical law violations, guiding targeted model improvements. This approach helps you objectively compare models like Sora 2 against others, ensuring your generated videos adhere to real-world physics more effectively.

Key insights

PQSG offers a fine-grained, graph-based question system to evaluate physical plausibility in text-to-video generation.

Principles

Method

PQSG uses a VLM to generate a hierarchical, graph-based question pipeline, guided by in-context examples, to evaluate video adherence to physical laws, objects, and actions.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.