PRISM: A Benchmark for Programmatic Spatial-Temporal Reasoning
Summary
PRISM is a new large-scale benchmark designed to rigorously evaluate language models' capacity for programmatic video generation, specifically focusing on spatially correct animated outputs. Comprising 10,372 human-calibrated instruction-code pairs, it is 20 times larger than previous benchmarks and covers 437 subject categories across English and Chinese real-world knowledge visualization scenarios. The benchmark employs a funnel-style evaluation framework with four metrics: Code-Level Reliability for executability, Spatial Reasoning for layout correctness, and Prompt-Aware Dynamic Visual Complexity (PADVC) and Temporal Density (TD) for dynamic expression. Initial evaluation of seven mainstream LLMs revealed a significant "Execution-Spatial Gap," with an average 41% drop from execution success rate to spatial pass rate, indicating that executable code often lacks spatial coherence. This highlights the necessity for evaluation beyond mere executability.
Key takeaway
For NLP Engineers and AI Scientists developing or evaluating language models for programmatic video generation, you must prioritize spatial reasoning metrics beyond mere code executability. The identified "Execution-Spatial Gap," where executable code fails spatial coherence 41% of the time, indicates that current evaluation methods are insufficient. Integrate benchmarks like PRISM and its funnel-style metrics into your development and testing workflows to ensure your models produce truly coherent and usable animated outputs.
Key insights
Language models exhibit a significant "Execution-Spatial Gap" in programmatic video generation, failing spatial coherence despite code executability.
Principles
- Programmatic video generation demands geometric precision and temporal coherence.
- Evaluation must extend beyond code executability to spatial correctness.
Method
A funnel-style evaluation framework assesses Code-Level Reliability, Spatial Reasoning, Prompt-Aware Dynamic Visual Complexity (PADVC), and Temporal Density (TD).
In practice
- Use PRISM to benchmark LLMs for spatially coherent video generation.
- Incorporate spatial reasoning metrics into programmatic code evaluation.
Topics
- PRISM Benchmark
- Programmatic Video Generation
- Spatial-Temporal Reasoning
- Language Models
- Benchmark Evaluation
- Code Generation
Best for: Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.