LLaMo: Scaling Pretrained Language Models for Unified Motion Understanding and Generation with Continuous Autoregressive Tokens

2026-02-16 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, long

Summary

LLaMo is a novel unified framework developed by researchers from Brown University and Meta that extends pretrained Large Language Models (LLMs) for both human motion understanding and generation. It addresses critical challenges in existing approaches, such as catastrophic forgetting of linguistic capabilities due to limited motion-text data and jitter artifacts from discrete motion tokenization. LLaMo employs a modality-specific Mixture-of-Transformers (MoT) architecture, which preserves the base LLM's language understanding while enabling scalable multimodal adaptation. It encodes human motion into a causal continuous latent space and uses a lightweight flow-matching head for real-time streaming motion generation at $\geq$30 FPS. The model was pretrained on a new in-house dataset of over 3 million motion sequences (3,076 hours) and demonstrates high-fidelity text-to-motion generation and motion-to-text captioning, including strong zero-shot performance, without compromising the LLM's original text-only capabilities.

Key takeaway

For Research Scientists developing multimodal AI, LLaMo demonstrates a robust method to integrate human motion capabilities into existing LLMs without degrading their core language performance. You should consider adopting modality-specific architectures and continuous latent spaces for new modalities to avoid catastrophic forgetting and quantization artifacts, especially when aiming for real-time, high-fidelity generation in resource-constrained environments.

Key insights

LLaMo unifies motion-language understanding and generation in LLMs using continuous tokens and a MoT architecture.

Principles

Preserve base LLM language competence.
Avoid discrete tokenization for continuous data.
Enable real-time streaming generation.

Method

LLaMo uses a MoT architecture to separate motion and language parameters, a causal continuous latent space for motion, and a flow-matching head for autoregressive generation, all built on a decoder-only Transformer backbone.

In practice

Integrate motion generation into LLMs.
Develop real-time motion synthesis systems.
Improve zero-shot motion generation.

Topics

LLaMo
Motion-Language Models
Continuous Motion Generation
Mixture-of-Transformers
Unified Multimodal AI

Best for: Research Scientist, AI Researcher, AI Scientist, Deep Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.