NEST: Narrative Event Structures in Time for Long Video Understanding

2025-06-21 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, long

Summary

NEST (Narrative Event Structures in Time) is a new dataset and benchmark designed to evaluate narrative understanding in long videos, specifically full-length movies. It comprises 1005 movies, averaging approximately 98 minutes each, annotated with around 102 multimodal narrative events grounded in visual content, dialogue, and audio. NEST introduces structured annotations that link events through temporal ordering, hierarchical composition, and long-range dependencies. The benchmark evaluates four tasks: Event Trigger Detection (ETD), Event Localization (EL), Event Argument Extraction (EAE), and Event Relation Extraction (ERE). Initial baselines show that current models struggle significantly with grounded event discovery, achieving ETD below 8%, EL under 6%, and EAE below 11%. ERE is more tractable, reaching 35.45% F1 zero-shot and 44.42% F1 after fine-tuning.

Key takeaway

For AI Scientists and Machine Learning Engineers developing long-form video understanding systems, NEST highlights a critical gap in narrative comprehension. You should prioritize research into models capable of reasoning over complex event relationships and long temporal distances, rather than just extended token streams. The low baseline scores on ETD, EL, and EAE indicate significant opportunities for innovation in structured event extraction and multimodal grounding, moving beyond simple retrieval tasks.

Key insights

Current vision-language models struggle with deep narrative structure and long-range dependencies in extended video content.

Principles

Narrative understanding requires reasoning beyond flat token streams.
Audio descriptions provide high-quality, human-created visual narratives.
PropBank conventions ensure consistent event semantics and argument roles.

Method

NEST employs an LLM-assisted pipeline for event trigger detection, argument extraction, relation extraction, and video localization, grounded in audio descriptions and PropBank-selected verbs.

In practice

Utilize audio descriptions for robust event grounding in video.
Adopt PropBank for consistent event ontology and argument roles.
Sparsely sample video at 0.1 FPS for long-context model training.

Topics

Long Video Understanding
Narrative Event Structures
Multimodal Event Extraction
Video Datasets
Vision-Language Models
PropBank
Temporal Reasoning

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.