SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models

2026-04-21 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

SafetyALFRED is a new benchmark designed to evaluate the safety-conscious planning abilities of Multimodal Large Language Models (MLLMs) when operating as autonomous agents in interactive environments. Built upon the existing ALFRED benchmark, SafetyALFRED incorporates six categories of real-world kitchen hazards. Unlike traditional safety evaluations that focus on hazard recognition via disembodied question answering (QA), this benchmark assesses eleven state-of-the-art MLLMs from the Qwen, Gemma, and Gemini families on both hazard recognition and active risk mitigation through embodied planning. Experimental results indicate a substantial alignment gap: models show high accuracy in recognizing hazards in QA settings but exhibit low average success rates for mitigating these same hazards in embodied contexts. This highlights the inadequacy of static QA evaluations for assessing physical safety.

Key takeaway

For research scientists developing or deploying MLLMs as autonomous agents, you should prioritize embodied planning benchmarks like SafetyALFRED over traditional QA evaluations. Your models' ability to recognize hazards does not guarantee their capacity to mitigate risks in physical environments, necessitating a shift in evaluation strategies to ensure real-world safety and reliability.

Key insights

MLLMs recognize hazards in QA but struggle with embodied mitigation, revealing a critical safety alignment gap.

Principles

Static QA is insufficient for physical safety.
Embodied planning is crucial for risk mitigation.

Method

SafetyALFRED augments the ALFRED benchmark with six kitchen hazard categories to evaluate MLLMs on both hazard recognition and active risk mitigation in embodied planning scenarios.

In practice

Evaluate MLLMs beyond QA for safety.
Focus on embodied planning for agent safety.

Topics

Multimodal Large Language Models
SafetyALFRED Benchmark
Embodied Planning
Hazard Recognition
Risk Mitigation

Code references

sled-group/SafetyALFRED

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.