Brick-Composer: Using MLLMs for Assembly with Diverse Bricks
Summary
Brick-Composer is a novel learning framework designed to enhance multimodal large language models (MLLMs) in complex brick assembly tasks. Addressing current MLLM limitations in fine-grained brick selection and precise pose estimation, the framework introduces three complementary learning signals: Human Design Sparks, World Feedback, and Synthetic Experience. To evaluate MLLM capabilities, the researchers also developed BC-Bench, the first benchmark for assembly with diverse bricks, formulating it as a sequential decision-making problem involving brick selection and pose estimation. Experiments demonstrate Brick-Composer's effectiveness, improving brick selection accuracy by over three times (from roughly 23% to around 70%) and raising strict step-level assembly success from less than 1% to approximately 15%, with a Qwen-3-8B model achieving up to 42% success for complete objects.
Key takeaway
For AI scientists and robotics engineers developing autonomous assembly agents, this research highlights a critical path forward. You should integrate physically grounded learning signals, specifically leveraging simulator-based world feedback and scalable synthetic data generation, to overcome MLLM limitations in fine-grained object manipulation and precise pose estimation. This approach can significantly improve assembly success rates, moving MLLMs from instruction interpretation to reliable execution in complex construction tasks.
Key insights
MLLMs can acquire complex brick assembly skills through targeted, physically grounded learning.
Principles
- Assembly learning benefits from design-driven structures.
- Simulator-based feedback aids error discovery and recovery.
- Scalable compositional diversity expands learning beyond existing designs.
Method
Brick-Composer equips MLLMs with assembly skills using Human Design Sparks for demonstrations, World Feedback for grounding actions in consequences, and Synthetic Experience for scalable, physically plausible configurations.
In practice
- Integrate simulator feedback for MLLM error correction.
- Generate synthetic data to scale spatial reasoning training.
- Use human design data for affordance-rich demonstrations.
Topics
- Multimodal Large Language Models
- Robotic Assembly
- Spatial Reasoning
- Brick-Composer
- BC-Bench
- Pose Estimation
- Synthetic Data Generation
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.