Bridging Semantic and Kinematic Conditions with Diffusion-based Discrete Motion Tokenizer

· Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

A novel three-stage framework is proposed to combine the strengths of continuous diffusion models, which excel at kinematic control, and discrete token-based generators, effective for semantic conditioning. Central to this framework is MoTok, a diffusion-based discrete motion tokenizer that decouples semantic abstraction from fine-grained reconstruction by delegating motion recovery to a diffusion decoder, enabling compact single-layer tokens while preserving motion fidelity. This design strategically applies coarse kinematic constraints during token planning and fine-grained constraints during diffusion-based control, preventing disruption to semantic token generation. On HumanML3D, the method significantly improves controllability and fidelity over MaskControl, reducing trajectory error from 0.72 cm to 0.08 cm and FID from 0.083 to 0.029, while using only one-sixth of the tokens. Notably, unlike prior methods, its fidelity improves under stronger kinematic constraints, reducing FID from 0.033 to 0.014.

Key takeaway

A new three-stage diffusion-based framework, featuring the MoTok tokenizer, effectively bridges semantic and kinematic conditions for motion generation. This method significantly improves controllability and fidelity on HumanML3D, reducing trajectory error from 0.72 cm to 0.08 cm and FID from 0.083 to 0.029 using one-sixth fewer tokens. Crucially, it maintains fidelity even under stronger kinematic constraints, offering a robust solution for high-precision motion synthesis.

Topics

Best for: AI Scientist, Research Scientist, AI Researcher, Machine Learning Engineer, AI Student

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.