Brick-Composer: Using MLLMs for Assembly with Diverse Bricks

2026-05-26 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, long

Summary

Brick-Composer is a novel learning framework designed to enhance multimodal large language models (MLLMs) in complex brick assembly tasks. Addressing current MLLM limitations in fine-grained brick selection and precise pose estimation, the framework introduces three complementary learning signals: Human Design Sparks, World Feedback, and Synthetic Experience. To evaluate MLLM capabilities, the researchers also developed BC-Bench, the first benchmark for assembly with diverse bricks, formulating it as a sequential decision-making problem involving brick selection and pose estimation. Experiments demonstrate Brick-Composer's effectiveness, improving brick selection accuracy by over three times (from roughly 23% to around 70%) and raising strict step-level assembly success from less than 1% to approximately 15%, with a Qwen-3-8B model achieving up to 42% success for complete objects.

Key takeaway

For AI scientists and robotics engineers developing autonomous assembly agents, this research highlights a critical path forward. You should integrate physically grounded learning signals, specifically leveraging simulator-based world feedback and scalable synthetic data generation, to overcome MLLM limitations in fine-grained object manipulation and precise pose estimation. This approach can significantly improve assembly success rates, moving MLLMs from instruction interpretation to reliable execution in complex construction tasks.

Key insights

MLLMs can acquire complex brick assembly skills through targeted, physically grounded learning.

Principles

Assembly learning benefits from design-driven structures.
Simulator-based feedback aids error discovery and recovery.
Scalable compositional diversity expands learning beyond existing designs.

Method

Brick-Composer equips MLLMs with assembly skills using Human Design Sparks for demonstrations, World Feedback for grounding actions in consequences, and Synthetic Experience for scalable, physically plausible configurations.

In practice

Integrate simulator feedback for MLLM error correction.
Generate synthetic data to scale spatial reasoning training.
Use human design data for affordance-rich demonstrations.

Topics

Multimodal Large Language Models
Robotic Assembly
Spatial Reasoning
Brick-Composer
BC-Bench
Pose Estimation
Synthetic Data Generation

Code references

Lumos-Jiateng/Brick-Composer

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.