VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition

2026-05-04 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Advanced, quick

Summary

VideoNet is a new large-scale dataset and benchmark designed to revitalize domain-specific action recognition for modern vision-language models (VLMs). It features 1,000 distinct actions across 37 domains, aiming to address the current lack of diverse and challenging data that has led to VLMs no longer being evaluated on action recognition. Initial evaluations show significant performance gaps, with Gemini 3.1 Pro achieving 69.9% accuracy in a multiple-choice setting compared to Qwen3-VL-8B's 45.0%. Even in a binary setting where random chance is 50%, Qwen only reaches 59.2%. While some models like Qwen improve with few-shot in-context examples (+7.0%), others like Gemini decline (-4.8%), and these gains are less than human improvement (+13.6%). To further enhance performance, the creators collected a training dataset of nearly 500k video question-answer pairs, which, when used to fine-tune a Molmo2-4B model, surpassed all open-weight 8B models on VideoNet.

Key takeaway

For research scientists developing or evaluating vision-language models, you should integrate VideoNet into your benchmarking process to accurately assess domain-specific action recognition capabilities. The dataset highlights current VLM limitations in exploiting in-context examples and offers a path for fine-tuning, suggesting that focusing on domain-specific training data can significantly improve model performance beyond generalist approaches.

Key insights

VideoNet introduces a large-scale dataset and benchmark to re-evaluate VLM performance on domain-specific action recognition.

Principles

Domain-specific actions challenge VLMs.
Few-shot learning varies across VLMs.

Method

VideoNet evaluates VLMs using multiple-choice, binary, and few-shot settings, then provides a large-scale training dataset for fine-tuning models to improve domain-specific action recognition.

In practice

Use VideoNet for VLM action recognition.
Fine-tune models with VideoNet data.

Topics

VideoNet
Action Recognition
Vision-Language Models
Domain-Specific Actions
Large-Scale Dataset

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.