Ex-Omni: Enabling 3D Facial Animation Generation for Omni-modal Large Language Models

2026-06-12 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, long

Summary

Ex-Omni is an open-source omni-modal framework designed to augment large language models (OLLMs) with speech-accompanied 3D facial animation generation. It addresses the challenge of integrating discrete, token-level LLM semantics with the dense, fine-grained temporal dynamics required for realistic 3D facial motion. Ex-Omni achieves this by decoupling semantic reasoning from temporal generation, utilizing discrete speech units as temporal scaffolding, and employing a unified token-as-query gated fusion (TQGF) mechanism for controlled semantic injection. The framework generates facial motion using ARKit-52 blendshape coefficients non-autoregressively. It is trained using InstructEx, a new dataset that includes a large-scale synthetic Speech-to-Face corpus generated by NVIDIA Audio2Face-3D. Experiments show Ex-Omni performs competitively against existing OLLMs, producing stable and aligned speech and facial animation.

Key takeaway

For AI Scientists and Machine Learning Engineers developing advanced human-computer interaction systems, Ex-Omni offers a robust approach to integrating expressive 3D facial animation into omni-modal LLMs. You should consider its decoupled architecture and token-as-query gated fusion mechanism to overcome challenges in aligning discrete semantic reasoning with dense temporal motion. This framework enables the creation of more natural and engaging digital avatars, potentially reducing the need for extensive real-world motion capture data through its synthetic data generation strategy.

Key insights

Ex-Omni integrates 3D facial animation into OLLMs by decoupling semantic reasoning from temporal generation using speech units and TQGF.

Principles

Decouple semantic reasoning from temporal generation.
Use discrete speech units for temporal scaffolding.
Employ gated fusion for controlled semantic injection.

Method

Ex-Omni maps speech/text to LLM tokens, performs LLM reasoning, then jointly generates speech units and ARKit-52 blendshape coefficients using TQGF. It trains in four stages: speech-text alignment, speech pre-training, speech-face co-training, and joint fine-tuning.

In practice

Generate synchronized speech and 3D facial animation.
Create expressive virtual characters and digital avatars.
Augment OLLMs for natural human-computer interaction.

Topics

Omni-modal LLMs
3D Facial Animation
Speech-to-Face Generation
ARKit Blendshapes
Token-as-Query Gated Fusion
InstructEx Dataset
Digital Avatars

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.