Ex-Omni: Enabling 3D Facial Animation Generation for Omni-modal Large Language Models
Summary
Ex-Omni is an open-source omni-modal framework designed to augment large language models (OLLMs) with speech-accompanied 3D facial animation generation. It addresses the challenge of integrating discrete, token-level LLM semantics with the dense, fine-grained temporal dynamics required for realistic 3D facial motion. Ex-Omni achieves this by decoupling semantic reasoning from temporal generation, utilizing discrete speech units as temporal scaffolding, and employing a unified token-as-query gated fusion (TQGF) mechanism for controlled semantic injection. The framework generates facial motion using ARKit-52 blendshape coefficients non-autoregressively. It is trained using InstructEx, a new dataset that includes a large-scale synthetic Speech-to-Face corpus generated by NVIDIA Audio2Face-3D. Experiments show Ex-Omni performs competitively against existing OLLMs, producing stable and aligned speech and facial animation.
Key takeaway
For AI Scientists and Machine Learning Engineers developing advanced human-computer interaction systems, Ex-Omni offers a robust approach to integrating expressive 3D facial animation into omni-modal LLMs. You should consider its decoupled architecture and token-as-query gated fusion mechanism to overcome challenges in aligning discrete semantic reasoning with dense temporal motion. This framework enables the creation of more natural and engaging digital avatars, potentially reducing the need for extensive real-world motion capture data through its synthetic data generation strategy.
Key insights
Ex-Omni integrates 3D facial animation into OLLMs by decoupling semantic reasoning from temporal generation using speech units and TQGF.
Principles
- Decouple semantic reasoning from temporal generation.
- Use discrete speech units for temporal scaffolding.
- Employ gated fusion for controlled semantic injection.
Method
Ex-Omni maps speech/text to LLM tokens, performs LLM reasoning, then jointly generates speech units and ARKit-52 blendshape coefficients using TQGF. It trains in four stages: speech-text alignment, speech pre-training, speech-face co-training, and joint fine-tuning.
In practice
- Generate synchronized speech and 3D facial animation.
- Create expressive virtual characters and digital avatars.
- Augment OLLMs for natural human-computer interaction.
Topics
- Omni-modal LLMs
- 3D Facial Animation
- Speech-to-Face Generation
- ARKit Blendshapes
- Token-as-Query Gated Fusion
- InstructEx Dataset
- Digital Avatars
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.