PRISM: A Benchmark for Programmatic Spatial-Temporal Reasoning

2026-05-21 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

PRISM is a new large-scale benchmark designed to rigorously evaluate language models' capacity for programmatic video generation, specifically focusing on spatially correct animated outputs. Comprising 10,372 human-calibrated instruction-code pairs, it is 20 times larger than previous benchmarks and covers 437 subject categories across English and Chinese real-world knowledge visualization scenarios. The benchmark employs a funnel-style evaluation framework with four metrics: Code-Level Reliability for executability, Spatial Reasoning for layout correctness, and Prompt-Aware Dynamic Visual Complexity (PADVC) and Temporal Density (TD) for dynamic expression. Initial evaluation of seven mainstream LLMs revealed a significant "Execution-Spatial Gap," with an average 41% drop from execution success rate to spatial pass rate, indicating that executable code often lacks spatial coherence. This highlights the necessity for evaluation beyond mere executability.

Key takeaway

For NLP Engineers and AI Scientists developing or evaluating language models for programmatic video generation, you must prioritize spatial reasoning metrics beyond mere code executability. The identified "Execution-Spatial Gap," where executable code fails spatial coherence 41% of the time, indicates that current evaluation methods are insufficient. Integrate benchmarks like PRISM and its funnel-style metrics into your development and testing workflows to ensure your models produce truly coherent and usable animated outputs.

Key insights

Language models exhibit a significant "Execution-Spatial Gap" in programmatic video generation, failing spatial coherence despite code executability.

Principles

Programmatic video generation demands geometric precision and temporal coherence.
Evaluation must extend beyond code executability to spatial correctness.

Method

A funnel-style evaluation framework assesses Code-Level Reliability, Spatial Reasoning, Prompt-Aware Dynamic Visual Complexity (PADVC), and Temporal Density (TD).

In practice

Use PRISM to benchmark LLMs for spatially coherent video generation.
Incorporate spatial reasoning metrics into programmatic code evaluation.

Topics

PRISM Benchmark
Programmatic Video Generation
Spatial-Temporal Reasoning
Language Models
Benchmark Evaluation
Code Generation

Best for: Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.