SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models
Summary
SafetyALFRED is a new benchmark designed to evaluate the safety-conscious planning abilities of Multimodal Large Language Models (MLLMs) when operating as autonomous agents in interactive environments. Built upon the existing ALFRED benchmark, SafetyALFRED incorporates six categories of real-world kitchen hazards. Unlike traditional safety evaluations that focus on hazard recognition via disembodied question answering (QA), this benchmark assesses eleven state-of-the-art MLLMs from the Qwen, Gemma, and Gemini families on both hazard recognition and active risk mitigation through embodied planning. Experimental results indicate a substantial alignment gap: models show high accuracy in recognizing hazards in QA settings but exhibit low average success rates for mitigating these same hazards in embodied contexts. This highlights the inadequacy of static QA evaluations for assessing physical safety.
Key takeaway
For research scientists developing or deploying MLLMs as autonomous agents, you should prioritize embodied planning benchmarks like SafetyALFRED over traditional QA evaluations. Your models' ability to recognize hazards does not guarantee their capacity to mitigate risks in physical environments, necessitating a shift in evaluation strategies to ensure real-world safety and reliability.
Key insights
MLLMs recognize hazards in QA but struggle with embodied mitigation, revealing a critical safety alignment gap.
Principles
- Static QA is insufficient for physical safety.
- Embodied planning is crucial for risk mitigation.
Method
SafetyALFRED augments the ALFRED benchmark with six kitchen hazard categories to evaluate MLLMs on both hazard recognition and active risk mitigation in embodied planning scenarios.
In practice
- Evaluate MLLMs beyond QA for safety.
- Focus on embodied planning for agent safety.
Topics
- Multimodal Large Language Models
- SafetyALFRED Benchmark
- Embodied Planning
- Hazard Recognition
- Risk Mitigation
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.