An Analysis Focused on Womens Safety: Can VAD Models Be Enhanced by a Multi-modal Dataset?

2026-06-08 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

The ExtrAnom dataset is introduced to address the significant lack of resources for video anomaly detection (VAD) focused on women's safety. Existing VAD datasets, often high-resolution and well-lit, fail to represent women-centric anomalies like chain snatching, stalking, and inappropriate touch, especially in low-light or low-resolution surveillance footage. ExtrAnom comprises 1001 real-world videos (500 normal, 501 anomalous) categorized into 5 types of women-centric crimes, including 8% low-light, 13% low-resolution, and 15% long-shot videos. Each video includes one human-generated and three LLM-generated textual descriptions, enabling cross-modal and VLM-based validations. Benchmarking against popular VAD datasets and SOTA methods reveals that existing models perform poorly on women-centric anomalies, highlighting ExtrAnom's importance.

Key takeaway

For AI Scientists and Machine Learning Engineers developing surveillance systems, recognize that current VAD models, including SOTA multi-modal LLMs, significantly misclassify women-centric anomalies due to data limitations. You should prioritize training and fine-tuning models with specialized datasets like ExtrAnom, which includes diverse real-world conditions and detailed textual annotations, to improve accuracy in detecting critical events such as stalking and chain snatching. This will lead to more reliable public safety applications.

Key insights

Existing VAD models fail to detect women-centric anomalies due to inadequate, unrepresentative training data.

Principles

Real-world surveillance conditions (low-light, low-res) are critical for effective VAD.
Multi-modal datasets with textual annotations enhance VLM performance for fine-grained anomaly detection.

Method

ExtrAnom dataset creation involves collecting real-world videos from diverse sources, categorizing 5 women-centric crime types, and generating multi-modal textual annotations using human input and LLMs (ChatGPT, DeepSeek, Mistral).

In practice

Use ExtrAnom to train VLMs for detecting subtle women-centric crimes.
Incorporate low-light and low-resolution video data for robust VAD model development.

Topics

Video Anomaly Detection
Women Safety
Multi-modal LLMs
ExtrAnom Dataset
Surveillance Videos
Vision Language Models

Code references

A24CS09005/ExtrAnom

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.