Quantitative Video World Model Evaluation for Geometric-Consistency

2026-05-14 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

Researchers have developed PDI-Bench (Perspective Distortion Index), a quantitative framework designed to evaluate the geometric coherence and physical plausibility of videos generated by generative video models. Traditional evaluation methods often rely on subjective human judgment or learned graders, which are less effective at diagnosing specific geometric failures. PDI-Bench processes generated video clips by first segmenting objects and tracking points using tools like SAM 2, MegaSaM, and CoTracker3. These observations are then lifted to 3D world-space coordinates via monocular reconstruction, allowing for the computation of projective-geometry residuals. These residuals quantify failures across three dimensions: scale-depth alignment, 3D motion consistency, and 3D structural rigidity. To facilitate comprehensive testing, the team also created PDI-Dataset, which includes diverse scenarios specifically crafted to challenge these geometric constraints. PDI-Bench has identified consistent, geometry-specific failure modes in state-of-the-art video generators that perceptual metrics often miss.

Key takeaway

For AI Scientists and Computer Vision Engineers developing or evaluating generative video models, PDI-Bench offers a critical diagnostic tool beyond perceptual metrics. Your model evaluations should incorporate quantitative geometric coherence checks to identify specific physical plausibility failures. This framework helps pinpoint areas for improvement in 3D structure and motion, guiding progress toward more physically grounded video generation and robust world models. Consider integrating PDI-Bench into your model development pipeline to ensure higher fidelity.

Key insights

PDI-Bench quantitatively audits generative video models for geometric coherence and physical plausibility using 3D reconstruction.

Principles

Geometric coherence requires scale-depth alignment.
3D motion consistency is crucial for physical plausibility.
Structural rigidity indicates realistic object behavior.

Method

PDI-Bench segments objects, tracks points, lifts them to 3D via monocular reconstruction, then computes projective-geometry residuals for scale-depth, motion, and rigidity.

In practice

Use SAM 2 for object segmentation.
Employ CoTracker3 for point tracking.
Evaluate 3D motion consistency.

Topics

PDI-Bench
Geometric Consistency
Video World Models
Generative Video Models
3D Reconstruction

Best for: AI Scientist, Computer Vision Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.