Brick-Composer: Using MLLMs for Assembly with Diverse Bricks
Summary
Brick-Composer is a novel learning framework designed to equip multimodal large language models (MLLMs) with capabilities for real-world brick assembly. This initiative addresses the challenge of MLLMs' visual grounding and spatial reasoning in constructing objects from reusable building blocks. Researchers formulated brick assembly as a sequential decision-making problem, involving brick selection and pose estimation. To evaluate progress, BC-Bench, the first benchmark for diverse brick assembly, was introduced. Initial experiments revealed that leading MLLMs perform poorly, achieving less than 1% strict step-level assembly success. Brick-Composer bridges this gap by integrating three complementary learning signals: Human Design Sparks for construction demonstrations, World Feedback for grounding actions in physical consequences, and Synthetic Experience for scalable learning. This framework significantly improves brick selection accuracy by over three times and substantially reduces pose estimation errors, boosting step-level assembly success to approximately 15%. For instance, a trained Qwen-3-8B model can correctly compose up to 42% of steps for a complete object.
Key takeaway
For Machine Learning Engineers developing AI agents for physical construction or manipulation, this research indicates that current MLLMs require specialized, grounded learning frameworks. You should integrate multi-modal signals like human demonstrations and real-world feedback to overcome inherent limitations in spatial reasoning and precise pose estimation. Consider using synthetic experience generation to scale training data, significantly improving assembly success rates from negligible levels to practical capabilities for complex tasks.
Key insights
MLLMs can acquire complex physical assembly skills through targeted, multi-modal, and physically grounded learning frameworks like Brick-Composer.
Principles
- MLLMs lack inherent fine-grained visual and spatial reasoning for assembly.
- Multi-modal learning signals are crucial for physical task acquisition.
- Grounding actions in real-world feedback improves MLLM performance.
Method
Brick-Composer trains MLLMs for sequential brick assembly using Human Design Sparks, World Feedback for physical grounding, and Synthetic Experience to scale learning beyond existing designs.
In practice
- Integrate human demonstrations for complex physical tasks.
- Use real-world feedback loops to refine MLLM actions.
- Generate synthetic data to expand MLLM training beyond limited designs.
Topics
- Multimodal LLMs
- Robotic Assembly
- Visual Grounding
- Pose Estimation
- BC-Bench Benchmark
- Synthetic Data
Best for: Research Scientist, AI Scientist, Robotics Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.