VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition
Summary
VideoNet is a new large-scale dataset and benchmark designed to revitalize domain-specific action recognition for modern vision-language models (VLMs). It features 1,000 distinct actions across 37 domains, aiming to address the current lack of diverse and challenging data that has led to VLMs no longer being evaluated on action recognition. Initial evaluations show significant performance gaps, with Gemini 3.1 Pro achieving 69.9% accuracy in a multiple-choice setting compared to Qwen3-VL-8B's 45.0%. Even in a binary setting where random chance is 50%, Qwen only reaches 59.2%. While some models like Qwen improve with few-shot in-context examples (+7.0%), others like Gemini decline (-4.8%), and these gains are less than human improvement (+13.6%). To further enhance performance, the creators collected a training dataset of nearly 500k video question-answer pairs, which, when used to fine-tune a Molmo2-4B model, surpassed all open-weight 8B models on VideoNet.
Key takeaway
For research scientists developing or evaluating vision-language models, you should integrate VideoNet into your benchmarking process to accurately assess domain-specific action recognition capabilities. The dataset highlights current VLM limitations in exploiting in-context examples and offers a path for fine-tuning, suggesting that focusing on domain-specific training data can significantly improve model performance beyond generalist approaches.
Key insights
VideoNet introduces a large-scale dataset and benchmark to re-evaluate VLM performance on domain-specific action recognition.
Principles
- Domain-specific actions challenge VLMs.
- Few-shot learning varies across VLMs.
Method
VideoNet evaluates VLMs using multiple-choice, binary, and few-shot settings, then provides a large-scale training dataset for fine-tuning models to improve domain-specific action recognition.
In practice
- Use VideoNet for VLM action recognition.
- Fine-tune models with VideoNet data.
Topics
- VideoNet
- Action Recognition
- Vision-Language Models
- Domain-Specific Actions
- Large-Scale Dataset
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.