Co-policy: Responsive Human-Robot Co-Creation for Musical Performances
Summary
The Co-policy framework enables responsive human-robot musical co-creation by separating semantic intent grounding, constrained musical variation, and visuomotor execution. It utilizes pre-inference semantic anchors and a fine-tuned Qwen-vl planner (F-Qwen) to convert speech, live musical seeds, and visual observations into structured co-creation plans. For low-latency execution, Co-policy introduces a Gaussian-Mixture Visuomotor Policy (GMP), a conditional mixture-density policy that maps target notes and visual context to multimodal robot actions in a single forward pass. Unlike robotic playback systems, Co-policy generates complementary musical responses under both musical and physical constraints. Real-robot chime experiments demonstrated improved intent alignment, execution accuracy, and response frequency compared to diffusion-policy and ablated baselines.
Key takeaway
For Robotics Engineers developing embodied AI for creative tasks, Co-policy's architecture demonstrates that separating semantic intent grounding, musical variation, and visuomotor execution is crucial. You should consider implementing distinct modules like F-Qwen for high-level planning and GMP for low-latency physical action. This approach can achieve responsive, physically constrained co-creation, improving alignment and accuracy in your human-robot interaction systems.
Key insights
Separating semantic intent, musical variation, and visuomotor execution enables responsive human-robot musical co-creation.
Principles
- Embodied AI can participate in human creativity through physical action.
- Physically grounded action generation is key for embodied human-AI co-creation.
- Separating concerns (semantic, variation, execution) improves co-creation.
Method
Co-policy uses F-Qwen for semantic grounding from speech/music/visuals to plans. GMP, a conditional mixture-density policy, maps notes/visual context to multimodal robot actions in a single forward pass for low-latency execution.
In practice
- Use F-Qwen for semantic plan generation.
- Implement GMP for low-latency visuomotor control.
- Design systems with separated semantic and execution layers.
Topics
- Human-Robot Co-creation
- Musical Performance
- Embodied AI
- Visuomotor Control
- Qwen-vl Planner
- Gaussian-Mixture Visuomotor Policy
Best for: AI Scientist, Robotics Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.