OSCBench: Benchmarking Object State Change in Text-to-Video Generation

2026-04-21 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Advanced, long

Summary

OSCBench is a new benchmark designed to evaluate Object State Change (OSC) performance in Text-to-Video (T2V) generation models, a critical aspect of action understanding often overlooked by existing benchmarks. Developed from instructional cooking data, OSCBench systematically organizes action-object interactions into regular, novel, and compositional scenarios to assess both in-distribution performance and generalization capabilities. The benchmark comprises 1,120 prompts across 140 object-state scenarios. Researchers evaluated six representative T2V models, including Open-Sora-2.0, HunyuanVideo, HunyuanVideo-1.5, Wan-2.2, Kling-2.5-Turbo, and Veo-3.1-Fast, using both human user studies and Multimodal Large Language Model (MLLM)-based automatic evaluation. Results indicate that current T2V models struggle with accurate and temporally consistent object state changes, particularly in novel and compositional settings, despite strong semantic and scene alignment.

Key takeaway

For research scientists developing or evaluating Text-to-Video models, you should prioritize improving object state change (OSC) capabilities. Current models consistently fail on OSC, particularly in novel or compositional action sequences, indicating a significant gap in language-grounded reasoning. Integrate OSCBench into your evaluation pipeline to diagnose specific weaknesses and guide model improvements towards more faithful action consequence realization.

Key insights

Current T2V models struggle with accurately generating object state changes, especially in novel and compositional scenarios.

Principles

Object state change is crucial for language-grounded action understanding.
Benchmarks should test both in-distribution performance and generalization.

Method

OSCBench constructs scenarios by abstracting actions and objects from instructional cooking data, categorizing them into regular, novel, and compositional types, and generating structured prompts for T2V model evaluation.

In practice

Focus T2V model development on improving object state change accuracy.
Use MLLMs with Chain-of-Thought for structured video evaluation.

Topics

Text-to-Video Generation
Object State Change
OSCBench Benchmark
Multimodal LLM Evaluation
Chain-of-Thought Reasoning

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.