MCBench: A Multicontext Safety Assessment Benchmark for Omni Large Language Models
Summary
MCBench is a new multicontext safety assessment benchmark designed for Omni Large Language Models (LLMs) that process vision, audio, and text simultaneously. Addressing limitations of existing visual-only or general reasoning benchmarks, MCBench features 1196 scenarios across four safety categories: physical harm, social harm, illegal harm, and property damage. Each unsafe scenario is paired with a minimally different safe counterpart to evaluate model sensitivity. Evaluations of state-of-the-art models, including Gemini-Flash-2.5 and Qwen-Omni-2.5-3B, revealed an average accuracy of approximately 64.5%. Findings indicate Omni LLMs struggle with subtle or non-physical risks like social and illegal harm, performing better with salient visual or acoustic cues. Analysis shows models can extract modality-specific information but often fail to integrate these cues effectively, leading to oversensitivity and false positives on safe scenarios.
Key takeaway
For AI Security Engineers deploying Omni LLMs in safety-critical applications, you must recognize current models' limitations in multicontext reasoning. Your evaluation should extend beyond visual-only benchmarks to include scenarios requiring integrated vision, audio, and speech analysis. Prioritize models and training strategies that demonstrate robust cross-modal information aggregation, especially for subtle social or legal risks, to mitigate oversensitivity and false positives in safe situations.
Key insights
Current Omni LLMs lack robust cross-modal reasoning for safety, struggling to integrate diverse sensory cues effectively.
Principles
- Multimodal safety requires cross-modal integration.
- Subtle risks challenge Omni LLMs more than salient cues.
- Oversensitivity to single cues causes false positives.
Method
MCBench constructs 1196 unsafe-safe scenario pairs across four safety categories, using Claude-Sonnet-4.5 for scenario generation and Gemini-Flash-2.5/Stable Audio 1.0 for multimodal content synthesis, with human expert refinement.
In practice
- Evaluate Omni LLMs with multicontext safety benchmarks.
- Focus training on cross-modal information aggregation.
- Develop architectures for balanced multicontext reasoning.
Topics
- Omni LLMs
- Multimodal Safety
- MCBench
- Cross-modal Reasoning
- Safety Benchmarking
- Multimodal AI Evaluation
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Security Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.