MCBench: A Multicontext Safety Assessment Benchmark for Omni Large Language Models

2026-06-06 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

MCBench is a new multicontext safety assessment benchmark designed for Omni Large Language Models (LLMs) that process vision, audio, and text simultaneously. Addressing limitations of existing visual-only or general reasoning benchmarks, MCBench features 1196 scenarios across four safety categories: physical harm, social harm, illegal harm, and property damage. Each unsafe scenario is paired with a minimally different safe counterpart to evaluate model sensitivity. Evaluations of state-of-the-art models, including Gemini-Flash-2.5 and Qwen-Omni-2.5-3B, revealed an average accuracy of approximately 64.5%. Findings indicate Omni LLMs struggle with subtle or non-physical risks like social and illegal harm, performing better with salient visual or acoustic cues. Analysis shows models can extract modality-specific information but often fail to integrate these cues effectively, leading to oversensitivity and false positives on safe scenarios.

Key takeaway

For AI Security Engineers deploying Omni LLMs in safety-critical applications, you must recognize current models' limitations in multicontext reasoning. Your evaluation should extend beyond visual-only benchmarks to include scenarios requiring integrated vision, audio, and speech analysis. Prioritize models and training strategies that demonstrate robust cross-modal information aggregation, especially for subtle social or legal risks, to mitigate oversensitivity and false positives in safe situations.

Key insights

Current Omni LLMs lack robust cross-modal reasoning for safety, struggling to integrate diverse sensory cues effectively.

Principles

Multimodal safety requires cross-modal integration.
Subtle risks challenge Omni LLMs more than salient cues.
Oversensitivity to single cues causes false positives.

Method

MCBench constructs 1196 unsafe-safe scenario pairs across four safety categories, using Claude-Sonnet-4.5 for scenario generation and Gemini-Flash-2.5/Stable Audio 1.0 for multimodal content synthesis, with human expert refinement.

In practice

Evaluate Omni LLMs with multicontext safety benchmarks.
Focus training on cross-modal information aggregation.
Develop architectures for balanced multicontext reasoning.

Topics

Omni LLMs
Multimodal Safety
MCBench
Cross-modal Reasoning
Safety Benchmarking
Multimodal AI Evaluation

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Security Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.