MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction
Summary
MolmoMotion introduces a novel approach to goal-conditioned 3D point motion forecasting, a critical task for visual intelligence. This system predicts future 3D trajectories of object points given a visual history and a language-described goal. It leverages 3D points in world coordinates as a class-agnostic, view-stable, and compact representation. The project includes MolmoMotion-1M, a large dataset of 1.16M action-described 3D point trajectories, and PointMotionBench, a human-verified benchmark covering 111 object categories and 61 motion types. MolmoMotion, the model, supports both autoregressive coordinate prediction and flow-matching-based trajectory generation, significantly outperforming existing baselines. Its learned 3D motion prior also enhances robot manipulation training efficiency and provides effective motion guidance for generative video models.
Key takeaway
For machine learning engineers developing autonomous agents or generative video models, MolmoMotion offers a robust framework for anticipating object movement. You should consider integrating goal-conditioned 3D point motion forecasting to improve planning accuracy and synthesize more realistic object behaviors. This approach can significantly enhance training efficiency for robot manipulation and provide superior motion guidance for your generative models.
Key insights
3D point trajectories, conditioned by language, enable robust motion forecasting and transfer to downstream tasks.
Principles
- 3D points offer class-agnostic, view-stable motion representation.
- Language instructions guide diverse motion pattern prediction.
- Learned 3D motion priors improve robot manipulation.
Method
MolmoMotion forecasts 3D point trajectories using a full stack: a 1.16M video corpus, a 111-category benchmark, and a model supporting autoregressive coordinate prediction or flow-matching generation.
In practice
- Use 3D point representations for general object motion.
- Integrate language goals for nuanced trajectory prediction.
- Apply 3D motion priors to enhance robot control.
Topics
- 3D Motion Forecasting
- Point Trajectories
- Language Instruction
- Robot Manipulation
- Generative Video Models
- MolmoMotion
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.