PaSBench-Video: A Streaming Video Benchmark for Proactive Safety Warning
Summary
PaSBench-Video is a new benchmark designed to evaluate video-capable multimodal large language models (MLLMs) as proactive safety monitors. Addressing limitations of existing benchmarks that use static inputs, ignore timing, and omit false-positive measurements, PaSBench-Video comprises 740 videos—481 risk and 259 no-risk—across driving, healthcare, daily life, and industrial production domains. Risk videos feature frame-level annotations for risk onset and accident boundaries, requiring models to causally observe video and issue temporally calibrated, content-correct warnings. Testing 13 MLLMs revealed no model exceeded 20.0% on the strictest metric, with recall tightly coupled to false-positive rates (Pearson correlation 0.64). Performance varied significantly by domain; models showed moderate recall at low false-positive rates in daily life scenarios, but fired indiscriminately in driving contexts, suggesting current models rely on scene-level activity cues rather than genuine reasoning about emerging harm.
Key takeaway
For AI Engineers developing proactive safety systems with MLLMs, recognize that current models are fundamentally inadequate for real-world deployment. Your focus must shift from mere activity detection to genuine reasoning about emerging harm, prioritizing temporal precision and drastically reducing false-positive rates. You should invest in research addressing domain-specific challenges, especially in environments like driving where hazardous and routine scenes appear similar, to build truly reliable warning systems.
Key insights
Current MLLMs fail to provide proactive, temporally precise safety warnings, struggling with false positives and domain-specific nuances.
Principles
- Proactive safety requires temporal precision and false-positive control.
- MLLM performance varies significantly across safety domains.
- Current MLLMs detect activity, not emerging harm.
Method
PaSBench-Video evaluates MLLMs by requiring causal video observation and temporally calibrated, content-correct warnings on 740 risk/no-risk videos with frame-level annotations.
In practice
- Develop MLLMs that reason about harm, not just activity.
- Prioritize false-positive reduction in safety-critical MLLMs.
- Tailor MLLM safety solutions to specific domain characteristics.
Topics
- PaSBench-Video
- Multimodal LLMs
- Proactive Safety
- Video Benchmarking
- False Positive Rate
- Temporal Reasoning
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.