LLaMo: Scaling Pretrained Language Models for Unified Motion Understanding and Generation with Continuous Autoregressive Tokens
Summary
LLaMo is a novel unified framework developed by researchers from Brown University and Meta that extends pretrained Large Language Models (LLMs) for both human motion understanding and generation. It addresses critical challenges in existing approaches, such as catastrophic forgetting of linguistic capabilities due to limited motion-text data and jitter artifacts from discrete motion tokenization. LLaMo employs a modality-specific Mixture-of-Transformers (MoT) architecture, which preserves the base LLM's language understanding while enabling scalable multimodal adaptation. It encodes human motion into a causal continuous latent space and uses a lightweight flow-matching head for real-time streaming motion generation at $\geq$30 FPS. The model was pretrained on a new in-house dataset of over 3 million motion sequences (3,076 hours) and demonstrates high-fidelity text-to-motion generation and motion-to-text captioning, including strong zero-shot performance, without compromising the LLM's original text-only capabilities.
Key takeaway
For Research Scientists developing multimodal AI, LLaMo demonstrates a robust method to integrate human motion capabilities into existing LLMs without degrading their core language performance. You should consider adopting modality-specific architectures and continuous latent spaces for new modalities to avoid catastrophic forgetting and quantization artifacts, especially when aiming for real-time, high-fidelity generation in resource-constrained environments.
Key insights
LLaMo unifies motion-language understanding and generation in LLMs using continuous tokens and a MoT architecture.
Principles
- Preserve base LLM language competence.
- Avoid discrete tokenization for continuous data.
- Enable real-time streaming generation.
Method
LLaMo uses a MoT architecture to separate motion and language parameters, a causal continuous latent space for motion, and a flow-matching head for autoregressive generation, all built on a decoder-only Transformer backbone.
In practice
- Integrate motion generation into LLMs.
- Develop real-time motion synthesis systems.
- Improve zero-shot motion generation.
Topics
- LLaMo
- Motion-Language Models
- Continuous Motion Generation
- Mixture-of-Transformers
- Unified Multimodal AI
Best for: Research Scientist, AI Researcher, AI Scientist, Deep Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.