Brick-Composer: Using MLLMs for Assembly with Diverse Bricks

2026-06-03 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Emerging Technologies & Innovation · Depth: Expert, quick

Summary

Brick-Composer is a novel learning framework designed to equip multimodal large language models (MLLMs) with capabilities for real-world brick assembly. This initiative addresses the challenge of MLLMs' visual grounding and spatial reasoning in constructing objects from reusable building blocks. Researchers formulated brick assembly as a sequential decision-making problem, involving brick selection and pose estimation. To evaluate progress, BC-Bench, the first benchmark for diverse brick assembly, was introduced. Initial experiments revealed that leading MLLMs perform poorly, achieving less than 1% strict step-level assembly success. Brick-Composer bridges this gap by integrating three complementary learning signals: Human Design Sparks for construction demonstrations, World Feedback for grounding actions in physical consequences, and Synthetic Experience for scalable learning. This framework significantly improves brick selection accuracy by over three times and substantially reduces pose estimation errors, boosting step-level assembly success to approximately 15%. For instance, a trained Qwen-3-8B model can correctly compose up to 42% of steps for a complete object.

Key takeaway

For Machine Learning Engineers developing AI agents for physical construction or manipulation, this research indicates that current MLLMs require specialized, grounded learning frameworks. You should integrate multi-modal signals like human demonstrations and real-world feedback to overcome inherent limitations in spatial reasoning and precise pose estimation. Consider using synthetic experience generation to scale training data, significantly improving assembly success rates from negligible levels to practical capabilities for complex tasks.

Key insights

MLLMs can acquire complex physical assembly skills through targeted, multi-modal, and physically grounded learning frameworks like Brick-Composer.

Principles

MLLMs lack inherent fine-grained visual and spatial reasoning for assembly.
Multi-modal learning signals are crucial for physical task acquisition.
Grounding actions in real-world feedback improves MLLM performance.

Method

Brick-Composer trains MLLMs for sequential brick assembly using Human Design Sparks, World Feedback for physical grounding, and Synthetic Experience to scale learning beyond existing designs.

In practice

Integrate human demonstrations for complex physical tasks.
Use real-world feedback loops to refine MLLM actions.
Generate synthetic data to expand MLLM training beyond limited designs.

Topics

Multimodal LLMs
Robotic Assembly
Visual Grounding
Pose Estimation
BC-Bench Benchmark
Synthetic Data

Best for: Research Scientist, AI Scientist, Robotics Engineer, Machine Learning Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.