Global-Local Monte Carlo Tree Search in Vision-Language Models for Text-to-3D Indoor Scene Generation

2026-06-08 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Gaming & Interactive Media · Depth: Expert, extended

Summary

A novel Global-Local Monte Carlo Tree Search (MCTS) method, guided by a Progress Reward Model (PRM), significantly advances text-to-3D indoor scene generation using Large Vision-Language Models (LVLMs). This approach addresses the error propagation inherent in existing sequential decision-making LVLM methods by modeling scene generation as a tree search problem constrained by spatial commonsense. It employs a hierarchical scene representation, abstracting scenes into room, region, floor object, and supported object levels, and utilizes global and local trees for object placement and parameter determination. The PRM-guided MCTS prunes unpromising branches and balances exploration-exploitation for optimal solutions. Additionally, the work introduces "3DTindo-bench," a large-scale benchmark with 65 scene types and 3,250 instructions. Experimental results demonstrate that this method generates more realistic 3D scenes, surpassing state-of-the-art approaches by approximately 14% on average performance scores. Source code and dataset are open-sourced.

Key takeaway

For AI Scientists or ML Engineers developing 3D scene generation systems, traditional sequential LVLM approaches are prone to error propagation. Adopting a PRM-guided MCTS framework with hierarchical scene decomposition significantly improves realism and error correction. You should consider implementing tree-search strategies and visual rendering for spatial reasoning to overcome limitations of chain-of-thought models, especially for complex, constrained generative tasks.

Key insights

PRM-guided MCTS with hierarchical scene representation enables LVLMs to generate realistic 3D indoor scenes by correcting errors and balancing search.

Principles

Decompose complex 3D scene generation into hierarchical levels.
Tree search allows backtracking to correct early placement errors.
Intermediate state evaluation prunes inefficient search paths.

Method

A hierarchical scene representation (room, region, object levels) guides a PRM-guided MCTS algorithm, which uses global and local trees for object placement and parameter determination, followed by diffusion-based re-texturing.

In practice

Use emoji grids to provide visual spatial context to LVLMs for layout reasoning.
Integrate diffusion models for consistent object texture generation.

Topics

Text-to-3D Generation
Vision-Language Models
Monte Carlo Tree Search
3D Scene Synthesis
Hierarchical Scene Representation

Code references

dw-dengwei/TreeSearchGen

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.