SVHighlights: Towards Extremely Long Sport Video Highlight Detection
Summary
SVHighlights is introduced as the first benchmark for highlight detection in extremely long sports videos, each exceeding one hour. Comprising 320 videos across multiple sports, it totals 640.18 hours with an average duration of 2.00 hours, significantly surpassing previous datasets. This benchmark is built using a scalable pipeline that aligns full-length sports broadcasts with their official highlight videos, avoiding costly manual per-clip annotation. To address challenges in long-form content, the paper also presents TF-SELECTOR, a training-free, segment-based method. TF-SELECTOR divides videos into context-aware segments by merging semantically consistent shots and uses a large language model (LLM) with multimodal inputs—visual captions, transcripts, and audio volume—to predict segment-level saliency scores. Experiments show TF-SELECTOR achieves superior performance on SVHighlights, improving +3.12 in HIT@1, +4.06 in HIT@K, and +2.95 in IoU over VTG-tuned baselines.
Key takeaway
For Machine Learning Engineers developing highlight detection systems for long-form video, consider adopting a segment-based, multimodal approach. Your current clip-level models likely struggle with hour-long content due to context limitations. Utilize off-the-shelf LLMs with visual captions, transcripts, and audio volume to achieve robust, training-free saliency prediction. This strategy, demonstrated by TF-SELECTOR, offers superior performance on videos exceeding one hour, providing a scalable solution for real-world applications.
Key insights
The core challenge of long-form video highlight detection is addressed by a new benchmark and a multimodal, segment-based LLM approach.
Principles
- Official highlights enable scalable labeling.
- Segment-based processing improves context.
- Multimodal inputs enhance saliency prediction.
Method
TF-SELECTOR segments videos by merging semantically consistent shots, then uses a VLM for captioning and an LLM to predict segment saliency from captions, transcripts, and audio volume.
In practice
- Use PSNR for robust frame alignment.
- Combine ASR transcripts with visual cues.
- Employ LLMs for zero-shot saliency scoring.
Topics
- Long-form Video Analysis
- Sports Highlight Detection
- Video Benchmarking
- Multimodal LLMs
- Training-Free Models
- Video Temporal Grounding
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.