OSCBench: Benchmarking Object State Change in Text-to-Video Generation
Summary
OSCBench is a new benchmark designed to evaluate Object State Change (OSC) performance in Text-to-Video (T2V) generation models, a critical aspect of action understanding often overlooked by existing benchmarks. Developed from instructional cooking data, OSCBench systematically organizes action-object interactions into regular, novel, and compositional scenarios to assess both in-distribution performance and generalization capabilities. The benchmark comprises 1,120 prompts across 140 object-state scenarios. Researchers evaluated six representative T2V models, including Open-Sora-2.0, HunyuanVideo, HunyuanVideo-1.5, Wan-2.2, Kling-2.5-Turbo, and Veo-3.1-Fast, using both human user studies and Multimodal Large Language Model (MLLM)-based automatic evaluation. Results indicate that current T2V models struggle with accurate and temporally consistent object state changes, particularly in novel and compositional settings, despite strong semantic and scene alignment.
Key takeaway
For research scientists developing or evaluating Text-to-Video models, you should prioritize improving object state change (OSC) capabilities. Current models consistently fail on OSC, particularly in novel or compositional action sequences, indicating a significant gap in language-grounded reasoning. Integrate OSCBench into your evaluation pipeline to diagnose specific weaknesses and guide model improvements towards more faithful action consequence realization.
Key insights
Current T2V models struggle with accurately generating object state changes, especially in novel and compositional scenarios.
Principles
- Object state change is crucial for language-grounded action understanding.
- Benchmarks should test both in-distribution performance and generalization.
Method
OSCBench constructs scenarios by abstracting actions and objects from instructional cooking data, categorizing them into regular, novel, and compositional types, and generating structured prompts for T2V model evaluation.
In practice
- Focus T2V model development on improving object state change accuracy.
- Use MLLMs with Chain-of-Thought for structured video evaluation.
Topics
- Text-to-Video Generation
- Object State Change
- OSCBench Benchmark
- Multimodal LLM Evaluation
- Chain-of-Thought Reasoning
Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.