When Robots Do the Chores: A Benchmark and Agent for Long-Horizon Household Task Execution

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

Researchers have introduced LongAct, a new benchmark designed to evaluate planning-level autonomy in long-horizon household tasks for embodied AI. LongAct uses free-form instructions and abstracts away low-level control to focus on high-level cognitive capabilities such as instruction understanding, dependency management, memory maintenance, and adaptive planning. Alongside the benchmark, they propose HoloMind, a Vision-Language Model (VLM)-driven agent featuring a Directed Acyclic Graph (DAG)-based hierarchical planner, a Multimodal Spatial Memory for persistent world modeling, an Episodic Memory for experience reuse, and a global Critic for reflective supervision. Experiments using GPT-5 and Qwen3-VL models demonstrate that HoloMind significantly enhances long-horizon performance, reducing dependence on model scale. Despite these advancements, top models achieved only 59% goal completion and 16% full-task success on LongAct, highlighting the benchmark's difficulty and the critical need for more robust long-horizon planning in embodied agents.

Key takeaway

For research scientists developing embodied AI agents, the LongAct benchmark reveals significant gaps in current long-horizon planning capabilities. You should prioritize developing more sophisticated hierarchical planning, robust memory systems, and reflective supervision mechanisms to improve task completion rates. The HoloMind agent's architecture offers a strong starting point for designing agents capable of handling complex, multi-step household tasks, moving beyond short-horizon navigation and manipulation.

Key insights

LongAct and HoloMind advance embodied AI by benchmarking and improving long-horizon household task execution.

Principles

Method

HoloMind employs a DAG-based hierarchical planner, Multimodal Spatial Memory, Episodic Memory, and a global Critic for VLM-driven long-horizon task execution.

In practice

Topics

Best for: Research Scientist, AI Scientist, Robotics Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.