HomeWorld: A Unified Floorplan-to-Furnished Framework for Generating Controllable, Densely Interactive Whole-Home Scenes

2026-06-06 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

HomeWorld is a unified hierarchical framework for generating controllable, densely interactive whole-home scenes from natural-language prompts. Developed by Ace Robotics, CUHK MMLab, and Shenzhen Loop Area Institute, it addresses 3D scene data scarcity by decomposing synthesis into stages. The system curates a 300K real residential floorplan dataset to train an LLM for fine-grained floorplan generation using a K-D tree representation. Building on this, HomeWorld employs image generation models for furniture layouts via multi-level roaming viewpoints, then places small manipulable objects on supporting surfaces. A VLM-based refiner iteratively corrects placements, and a 3D generative model enables flexible asset replacement. The pipeline adds physical attributes, textures, and lighting, making scenes ready for embodied AI simulation. Experiments show superior layout diversity and 3D design appeal. The project will release its 300K floorplan dataset and 5K fully furnished scenes.

Key takeaway

For AI Scientists or Machine Learning Engineers developing embodied AI agents, HomeWorld provides a critical advancement in virtual environment generation. Its unified hierarchical framework and forthcoming dataset enable you to create high-fidelity, interactive whole-home scenes, moving beyond isolated room designs. You should explore integrating this approach to build more realistic and functionally plausible simulation environments, significantly reducing manual scene creation efforts for complex tasks.

Key insights

Unified hierarchical generation combining 2D priors with 3D grounding creates diverse, simulation-ready whole-home scenes.

Principles

Hierarchical decomposition simplifies complex scene generation.
2D generative models provide strong, scalable semantic priors.
Iterative refinement ensures physical and semantic consistency.

Method

LLM generates K-D tree floorplans. Hierarchical roaming (top-down, ego-centric) places furniture via image inpainting. A VLM refiner iteratively corrects layouts. Surface-centric placement adds manipulable objects, then physical attributes and lighting.

In practice

Use K-D tree representation for structured floorplan generation.
Employ VLM-based refiners for iterative layout correction.
Integrate 2D image models for diverse object placement.

Topics

Indoor Scene Generation
Embodied AI Simulation
Large Language Models
3D Scene Synthesis
K-D Tree Representation
Vision-Language Models

Best for: Research Scientist, Computer Vision Engineer, AI Scientist, Machine Learning Engineer, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.