NEST: Narrative Event Structures in Time for Long Video Understanding
Summary
NEST (Narrative Event Structures in Time) is a new dataset and benchmark designed to evaluate narrative understanding in long videos, specifically full-length movies. It comprises 1005 movies, averaging approximately 98 minutes each, annotated with around 102 multimodal narrative events grounded in visual content, dialogue, and audio. NEST introduces structured annotations that link events through temporal ordering, hierarchical composition, and long-range dependencies. The benchmark evaluates four tasks: Event Trigger Detection (ETD), Event Localization (EL), Event Argument Extraction (EAE), and Event Relation Extraction (ERE). Initial baselines show that current models struggle significantly with grounded event discovery, achieving ETD below 8%, EL under 6%, and EAE below 11%. ERE is more tractable, reaching 35.45% F1 zero-shot and 44.42% F1 after fine-tuning.
Key takeaway
For AI Scientists and Machine Learning Engineers developing long-form video understanding systems, NEST highlights a critical gap in narrative comprehension. You should prioritize research into models capable of reasoning over complex event relationships and long temporal distances, rather than just extended token streams. The low baseline scores on ETD, EL, and EAE indicate significant opportunities for innovation in structured event extraction and multimodal grounding, moving beyond simple retrieval tasks.
Key insights
Current vision-language models struggle with deep narrative structure and long-range dependencies in extended video content.
Principles
- Narrative understanding requires reasoning beyond flat token streams.
- Audio descriptions provide high-quality, human-created visual narratives.
- PropBank conventions ensure consistent event semantics and argument roles.
Method
NEST employs an LLM-assisted pipeline for event trigger detection, argument extraction, relation extraction, and video localization, grounded in audio descriptions and PropBank-selected verbs.
In practice
- Utilize audio descriptions for robust event grounding in video.
- Adopt PropBank for consistent event ontology and argument roles.
- Sparsely sample video at 0.1 FPS for long-context model training.
Topics
- Long Video Understanding
- Narrative Event Structures
- Multimodal Event Extraction
- Video Datasets
- Vision-Language Models
- PropBank
- Temporal Reasoning
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.