Social Structure Matters in 3D Human-Human Interaction Generation
Summary
A new "Solo-to-Social" framework addresses the challenge of generating realistic 3D human-human interaction (HHI) by explicitly modeling underlying social structure. Traditional text-to-motion generation struggles with HHI's complex phase progression, actor roles, and inter-actor coordination. Researchers found that large language models (LLMs) can effectively infer interaction phases and partner-aware roles, but fail to generate dynamic, physically plausible motion directly. This insight led to the "Think with LLM, Move with Motion Skill" paradigm. Here, an LLM acts as a planner, converting implicit interaction semantics into motion-aligned social supervision by decomposing interactions into phases and assigning partner-aware actor roles. A motion executor then grounds this planned social structure into coordinated two-person motion, adapting a pretrained solo motion model using LoRA, previous-phase self-conditioning, and ego-relative partner conditioning. This approach significantly improves phase consistency, role alignment, and partner-aware coordination in generated 3D HHI.
Key takeaway
For computer vision engineers developing 3D human-human interaction systems, you should consider a decoupled planning and execution approach. Your current LLM-based methods might excel at understanding social cues but struggle with physical motion realism. Implement a "Think with LLM, Move with Motion Skill" paradigm, using LLMs for high-level social structure planning and a specialized motion executor to ground these plans into physically plausible, coordinated 3D movements. This strategy improves phase consistency and role alignment in your generated interactions.
Key insights
LLMs can plan social structure for 3D human-human interaction, but require a separate motion executor to generate physically plausible movements.
Principles
- Social structure governs HHI phase progression and actor coordination.
- LLMs excel at abstract planning, not direct motion generation.
- Adapting solo motion models can create coordinated HHI.
Method
The "Think with LLM, Move with Motion Skill" paradigm uses an LLM planner for phase decomposition and role assignment, then a motion executor (adapting a solo model with LoRA, self-conditioning, partner conditioning) for 3D motion realization.
In practice
- Decompose complex interactions into distinct phases.
- Utilize LLMs for high-level interaction planning.
- Adapt existing solo motion models for multi-person scenes.
Topics
- 3D Human-Human Interaction
- Text-to-Motion Generation
- Social Structure Modeling
- Large Language Models
- Motion Planning
- LoRA
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.